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Preface 


This  book  is  intended  as  a  text  covering  the  central  concepts  of  practical  optimiza¬ 
tion  techniques.  It  is  designed  for  either  self-study  by  professionals  or  classroom 
work  at  the  undergraduate  or  graduate  level  for  students  who  have  a  technical  back¬ 
ground  in  engineering,  mathematics,  or  science.  Like  the  field  of  optimization  itself, 
which  involves  many  classical  disciplines,  the  book  should  be  useful  to  system  ana¬ 
lysts,  operations  researchers,  numerical  analysts,  management  scientists,  and  other 
specialists  from  the  host  of  disciplines  from  which  practical  optimization  appli¬ 
cations  are  drawn.  The  prerequisites  for  convenient  use  of  the  book  are  relatively 
modest;  the  prime  requirement  being  some  familiarity  with  introductory  elements 
of  linear  algebra.  Certain  sections  and  developments  do  assume  some  knowledge 
of  more  advanced  concepts  of  linear  algebra,  such  as  eigenvector  analysis,  or  some 
background  in  sets  of  real  numbers,  but  the  text  is  structured  so  that  the  mainstream 
of  the  development  can  be  faithfully  pursued  without  reliance  on  this  more  advanced 
background  material. 

Although  the  book  covers  primarily  material  that  is  now  fairly  standard,  this  edi¬ 
tion  emphasizes  methods  that  are  both  state-of-the-art  and  popular.  One  major  in¬ 
sight  is  the  connection  between  the  purely  analytical  character  of  an  optimization 
problem,  expressed  perhaps  by  properties  of  the  necessary  conditions,  and  the  be¬ 
havior  of  algorithms  used  to  solve  a  problem.  This  was  a  major  theme  of  the  first 
edition  of  this  book  and  the  fourth  edition  expands  and  further  illustrates  this  rela¬ 
tionship. 

As  in  the  earlier  editions,  the  material  in  this  fourth  edition  is  organized  into  three 
separate  parts.  Part  I  is  a  self-contained  introduction  to  linear  programming,  a  key 
component  of  optimization  theory.  The  presentation  in  this  part  is  fairly  conven¬ 
tional,  covering  the  main  elements  of  the  underlying  theory  of  linear  programming, 
many  of  the  most  effective  numerical  algorithms,  and  many  of  its  important  special 
applications.  Part  II,  which  is  independent  of  Part  I,  covers  the  theory  of  uncon¬ 
strained  optimization,  including  both  derivations  of  the  appropriate  optimality  con¬ 
ditions  and  an  introduction  to  basic  algorithms.  This  part  of  the  book  explores  the 
general  properties  of  algorithms  and  defines  various  notions  of  convergence.  Part  III 


Vll 
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extends  the  concepts  developed  in  the  second  part  to  constrained  optimization 
problems.  Except  for  a  few  isolated  sections,  this  part  is  also  independent  of  Part  I. 
It  is  possible  to  go  directly  into  Parts  II  and  III  omitting  Part  I,  and,  in  fact,  the 
book  has  been  used  in  this  way  in  many  universities.  Each  part  of  the  book  contains 
enough  material  to  form  the  basis  of  a  one-quarter  course.  In  either  classroom  use 
or  for  self-study,  it  is  important  not  to  overlook  the  suggested  exercises  at  the  end  of 
each  chapter.  The  selections  generally  include  exercises  of  a  computational  variety 
designed  to  test  one’s  understanding  of  a  particular  algorithm,  a  theoretical  variety 
designed  to  test  one’s  understanding  of  a  given  theoretical  development,  or  of  the 
variety  that  extends  the  presentation  of  the  chapter  to  new  applications  or  theoretical 
areas.  One  should  attempt  at  least  four  or  five  exercises  from  each  chapter.  In  pro¬ 
gressing  through  the  book  it  would  be  unusual  to  read  straight  through  from  cover 
to  cover.  Generally,  one  will  wish  to  skip  around.  In  order  to  facilitate  this  mode,  we 
have  indicated  sections  of  a  specialized  or  digressive  nature  with  an  asterisk*. 

New  to  this  edition  is  a  special  Chap.  6  devoted  to  Conic  Linear  Programming,  a 
powerful  generalization  of  Linear  Programming.  While  the  constraint  set  in  a  nor¬ 
mal  linear  program  is  defined  by  a  finite  number  of  linear  inequalities  of  finite¬ 
dimensional  vector  variables,  the  constraint  set  in  conic  linear  programming  may  be 
defined,  for  example,  as  a  linear  combination  of  symmetric  positive  semi-definite 
matrices  of  a  given  dimension.  Indeed,  many  conic  structures  are  possible  and  use¬ 
ful  in  a  variety  of  applications.  It  must  be  recognized,  however,  that  conic  linear 
programming  is  an  advanced  topic,  requiring  special  study. 

Another  important  topic  is  an  accelerated  steepest  descent  method  that  exhibits 
superior  convergence  properties,  and  for  this  reason,  has  become  quite  popular.  The 
proof  of  the  convergence  property  for  both  standard  and  accelerated  steepest  descent 
methods  are  presented  in  Chap.  8. 

As  the  field  of  optimization  advances,  addressing  greater  complexity,  treating 
problems  with  ever  more  variables  (as  in  Big  Data  situations),  ranging  over  diverse 
applications.  The  field  responds  yo  these  challenges,  developing  new  algorithms, 
building  effective  software,  and  expanding  overall  theory.  An  example  of  a  valu¬ 
able  new  development  is  the  work  on  big  data  problems.  Surprisingly,  coordinate 
descent,  with  randomly  selected  coordinates  at  each  step,  is  quite  effective  as  ex¬ 
plained  in  Chap.  8.  As  another  example  some  problems  are  formulated  so  that  the 
unknowns  can  be  split  into  two  sub  groups,  there  are  linear  constraints  and  the  objec¬ 
tive  function  is  separable  with  respect  to  the  two  groups  of  variables.  The  augmented 
Lagrangian  can  be  computed  and  it  is  natural  to  use  an  alternating  series  method. 
We  discuss  the  alternating  direction  method  with  multipliers  as  a  dual  method  in 
Chap.  14.  Interestingly,  this  method  is  convergent  for  when  the  number  of  partition 
groups  is  two,  but  not  for  finer  partitions. 

We  wish  to  thank  the  many  students  and  researchers  who  over  the  years  have 
given  us  comments  concerning  the  book  and  those  who  encouraged  us  to  carry  out 
this  revision. 


Stanford,  CA,  USA 
Stanford,  CA,  USA 
January  2015 


D.G.  Luenberger 
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Chapter  1 

Introduction 


1.1  Optimization 

The  concept  of  optimization  is  now  well  rooted  as  a  principle  underlying  the  analysis 
of  many  complex  decision  or  allocation  problems.  It  offers  a  certain  degree  of  philo¬ 
sophical  elegance  that  is  hard  to  dispute,  and  it  often  offers  an  indispensable  degree 
of  operational  simplicity.  Using  this  optimization  philosophy,  one  approaches  a 
complex  decision  problem,  involving  the  selection  of  values  for  a  number  of  in¬ 
terrelated  variables,  by  focusing  attention  on  a  single  objective  designed  to  quantify 
performance  and  measure  the  quality  of  the  decision.  This  one  objective  is  maxi¬ 
mized  (or  minimized,  depending  on  the  formulation)  subject  to  the  constraints  that 
may  limit  the  selection  of  decision  variable  values.  If  a  suitable  single  aspect  of  a 
problem  can  be  isolated  and  characterized  by  an  objective,  be  it  profit  or  loss  in 
a  business  setting,  speed  or  distance  in  a  physical  problem,  expected  return  in  the 
environment  of  risky  investments,  or  social  welfare  in  the  context  of  government 
planning,  optimization  may  provide  a  suitable  framework  for  analysis. 

It  is,  of  course,  a  rare  situation  in  which  it  is  possible  to  fully  represent  all  the 
complexities  of  variable  interactions,  constraints,  and  appropriate  objectives  when 
faced  with  a  complex  decision  problem.  Thus,  as  with  all  quantitative  techniques 
of  analysis,  a  particular  optimization  formulation  should  be  regarded  only  as  an 
approximation.  Skill  in  modeling,  to  capture  the  essential  elements  of  a  problem, 
and  good  judgment  in  the  interpretation  of  results  are  required  to  obtain  meaningful 
conclusions.  Optimization,  then,  should  be  regarded  as  a  tool  of  conceptualization 
and  analysis  rather  than  as  a  principle  yielding  the  philosophically  correct  solution. 

Skill  and  good  judgment,  with  respect  to  problem  formulation  and  interpretation 
of  results,  is  enhanced  through  concrete  practical  experience  and  a  thorough  under¬ 
standing  of  relevant  theory.  Problem  formulation  itself  always  involves  a  tradeoff 
between  the  conflicting  objectives  of  building  a  mathematical  model  sufficiently 
complex  to  accurately  capture  the  problem  description  and  building  a  model  that  is 
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tractable.  The  expert  model  builder  is  facile  with  both  aspects  of  this  tradeoff.  One 
aspiring  to  become  such  an  expert  must  learn  to  identify  and  capture  the  important 
issues  of  a  problem  mainly  through  example  and  experience;  one  must  learn  to 
distinguish  tractable  models  from  nontractable  ones  through  a  study  of  available 
technique  and  theory  and  by  nurturing  the  capability  to  extend  existing  theory  to 
new  situations. 

This  book  is  centered  around  a  certain  optimization  structure — that  characteristic 
of  linear  and  nonlinear  programming.  Examples  of  situations  leading  to  this  struc¬ 
ture  are  sprinkled  throughout  the  book,  and  these  examples  should  help  to  indicate 
how  practical  problems  can  be  often  fruitfully  structured  in  this  form.  The  book 
mainly,  however,  is  concerned  with  the  development,  analysis,  and  comparison  of 
algorithms  for  solving  general  subclasses  of  optimization  problems.  This  is  valuable 
not  only  for  the  algorithms  themselves,  which  enable  one  to  solve  given  problems, 
but  also  because  identification  of  the  collection  of  structures  they  most  effectively 
solve  can  enhance  one’s  ability  to  formulate  problems. 


1.2  Types  of  Problems 

The  content  of  this  book  is  divided  into  three  major  parts:  Linear  Programming, 
Unconstrained  Problems,  and  Constrained  Problems.  The  last  two  parts  together 
comprise  the  subject  of  nonlinear  programming. 


Linear  Programming 

Linear  programming  is  without  doubt  the  most  natural  mechanism  for  formulat¬ 
ing  a  vast  array  of  problems  with  modest  effort.  A  linear  programming  problem 
is  characterized,  as  the  name  implies,  by  linear  functions  of  the  unknowns;  the 
objective  is  linear  in  the  unknowns,  and  the  constraints  are  linear  equalities  or  linear 
inequalities  in  the  unknowns.  One  familiar  with  other  branches  of  linear  mathe¬ 
matics  might  suspect,  initially,  that  linear  programming  formulations  are  popular 
because  the  mathematics  is  nicer,  the  theory  is  richer,  and  the  computation  simpler 
for  linear  problems  than  for  nonlinear  ones.  But,  in  fact,  these  are  not  the  primary 
reasons.  In  terms  of  mathematical  and  computational  properties,  there  are  much 
broader  classes  of  optimization  problems  than  linear  programming  problems  that 
have  elegant  and  potent  theories  and  for  which  effective  algorithms  are  available. 
It  seems  that  the  popularity  of  linear  programming  lies  primarily  with  the  formu¬ 
lation  phase  of  analysis  rather  than  the  solution  phase — and  for  good  cause.  Lor 
one  thing,  a  great  number  of  constraints  and  objectives  that  arise  in  practice  are 
indisputably  linear.  Thus,  for  example,  if  one  formulates  a  problem  with  a  budget 
constraint  restricting  the  total  amount  of  money  to  be  allocated  among  two  different 
commodities,  the  budget  constraint  takes  the  form  x\  +  V2  <  B,  where  Xj,  i  =  1,2, 
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is  the  amount  allocated  to  activity  i,  and  B  is  the  budget.  Similarly,  if  the  objective 
is,  for  example,  maximum  weight,  then  it  can  be  expressed  as  w\X\  +  W2X2,  where 
Wj,  i  =  1,2,  is  the  unit  weight  of  the  commodity  i.  The  overall  problem  would 
be  expressed  as 


maximize  w\X\  +  W2X2 
subject  to  x\  +  %2  <  B , 

X\  >  0,  %2  >  0, 

which  is  an  elementary  linear  program.  The  linearity  of  the  budget  constraint  is 
extremely  natural  in  this  case  and  does  not  represent  simply  an  approximation  to  a 
more  general  functional  form. 

Another  reason  that  linear  forms  for  constraints  and  objectives  are  so  popular  in 
problem  formulation  is  that  they  are  often  the  least  difficult  to  define.  Thus,  even  if 
an  objective  function  is  not  purely  linear  by  virtue  of  its  inherent  definition  (as  in 
the  above  example),  it  is  often  far  easier  to  define  it  as  being  linear  than  to  decide  on 
some  other  functional  form  and  convince  others  that  the  more  complex  form  is  the 
best  possible  choice.  Linearity,  therefore,  by  virtue  of  its  simplicity,  often  is  selected 
as  the  easy  way  out  or,  when  seeking  generality,  as  the  only  functional  form  that  will 
be  equally  applicable  (or  nonapplicable)  in  a  class  of  similar  problems. 

Of  course,  the  theoretical  and  computational  aspects  do  take  on  a  somewhat  spe¬ 
cial  character  for  linear  programming  problems — the  most  significant  development 
being  the  simplex  method.  This  algorithm  is  developed  in  Chaps.  2  and  3.  More  re¬ 
cent  interior  point  methods  are  nonlinear  in  character  and  these  are  developed  in 
Chap.  5. 


Unconstrained  Problems 

It  may  seem  that  unconstrained  optimization  problems  are  so  devoid  of  structural 
properties  as  to  preclude  their  applicability  as  useful  models  of  meaningful  problems. 
Quite  the  contrary  is  true  for  two  reasons.  First,  it  can  be  argued,  quite  convincingly, 
that  if  the  scope  of  a  problem  is  broadened  to  the  consideration  of  all  relevant  de¬ 
cision  variables,  there  may  then  be  no  constraints — or  put  another  way,  constraints 
represent  artificial  delimitations  of  scope,  and  when  the  scope  is  broadened  the  con¬ 
straints  vanish.  Thus,  for  example,  it  may  be  argued  that  a  budget  constraint  is  not 
characteristic  of  a  meaningful  problem  formulation;  since  by  borrowing  at  some 
interest  rate  it  is  always  possible  to  obtain  additional  funds,  and  hence  rather  than 
introducing  a  budget  constraint,  a  term  reflecting  the  cost  of  funds  should  be  incor¬ 
porated  into  the  objective.  A  similar  argument  applies  to  constraints  describing  the 
availability  of  other  resources  which  at  some  cost  (however  great)  could  be  supple¬ 
mented. 

The  second  reason  that  many  important  problems  can  be  regarded  as  hav¬ 
ing  no  constraints  is  that  constrained  problems  are  sometimes  easily  converted  to 
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unconstrained  problems.  For  instance,  the  sole  effect  of  equality  constraints  is  sim¬ 
ply  to  limit  the  degrees  of  freedom,  by  essentially  making  some  variables  functions 
of  others.  These  dependencies  can  sometimes  be  explicitly  characterized,  and  a  new 
problem  having  its  number  of  variables  equal  to  the  true  degree  of  freedom  can  be 
determined.  As  a  simple  specific  example,  a  constraint  of  the  form  x\  +  X2  =  B  can 
be  eliminated  by  substituting  X2  =  B  -  x\  everywhere  else  that  V2  appears  in  the 
problem. 

Aside  from  representing  a  significant  class  of  practical  problems,  the  study  of  un¬ 
constrained  problems,  of  course,  provides  a  stepping  stone  toward  the  more  general 
case  of  constrained  problems.  Many  aspects  of  both  theory  and  algorithms  are  most 
naturally  motivated  and  verified  for  the  unconstrained  case  before  progressing  to  the 
constrained  case. 


Constrained  Problems 

In  spite  of  the  arguments  given  above,  many  problems  met  in  practice  are  formulated 
as  constrained  problems.  This  is  because  in  most  instances  a  complex  problem  such 
as,  for  example,  the  detailed  production  policy  of  a  giant  corporation,  the  planning 
of  a  large  government  agency,  or  even  the  design  of  a  complex  device  cannot  be 
directly  treated  in  its  entirety  accounting  for  all  possible  choices,  but  instead  must  be 
decomposed  into  separate  subproblems — each  subproblem  having  constraints  that 
are  imposed  to  restrict  its  scope.  Thus,  in  a  planning  problem,  budget  constraints  are 
commonly  imposed  in  order  to  decouple  that  one  problem  from  a  more  global  one. 
Therefore,  one  frequently  encounters  general  nonlinear  constrained  mathematical 
programming  problems. 

The  general  mathematical  programming  problem  can  be  stated  as 

minimize  /(x) 

subject  to  hj(x)  =  0,  i  =  1, 2,  . . . ,  m 
gj(x)  <  0,  j  =  1,2,  p 
xeS. 

In  this  formulation,  x  is  an  n-dimensional  vector  of  unknowns,  x  =  (x\,  X2,  . . . , 
xn),  and  /,  hi ,  i  =  1,2,  . . . ,  m,  and  gj,  j  =  1,2,  . . . ,  p ,  are  real- valued  functions  of 
the  variables  x\ ,  X2,  . . . ,  xn.  The  set  S  is  a  subset  of  ^-dimensional  space.  The  func¬ 
tion  /  is  the  objective  function  of  the  problem  and  the  equations,  inequalities,  and 
set  restrictions  are  constraints. 

Generally,  in  this  book,  additional  assumptions  are  introduced  in  order  to  make 
the  problem  smooth  in  some  suitable  sense.  For  example,  the  functions  in  the  prob¬ 
lem  are  usually  required  to  be  continuous,  or  perhaps  to  have  continuous  derivatives. 
This  ensures  that  small  changes  in  x  lead  to  small  changes  in  other  values  associ¬ 
ated  with  the  problem.  Also,  the  set  S  is  not  allowed  to  be  arbitrary  but  usually  is 
required  to  be  a  connected  region  of  ^-dimensional  space,  rather  than,  for  example, 
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a  set  of  distinct  isolated  points.  This  ensures  that  small  changes  in  x  can  be  made. 
Indeed,  in  a  majority  of  problems  treated,  the  set  S  is  taken  to  be  the  entire  space; 
there  is  no  set  restriction. 

In  view  of  these  smoothness  assumptions,  one  might  characterize  the  problems 
treated  in  this  book  as  continuous  variable  programming ,  since  we  generally  discuss 
problems  where  all  variables  and  function  values  can  be  varied  continuously.  In  fact, 
this  assumption  forms  the  basis  of  many  of  the  algorithms  discussed,  which  operate 
essentially  by  making  a  series  of  small  movements  in  the  unknown  x  vector. 


1.3  Size  of  Problems 

One  obvious  measure  of  the  complexity  of  a  programming  problem  is  its  size, 
measured  in  terms  of  the  number  of  unknown  variables  or  the  number  of  constraints. 
As  might  be  expected,  the  size  of  problems  that  can  be  effectively  solved  has  been 
increasing  with  advancing  computing  technology  and  with  advancing  theory.  Today, 
with  present  computing  capabilities,  however,  it  is  reasonable  to  distinguish  three 
classes  of  problems:  small-scale  problems  having  about  five  or  fewer  unknowns 
and  constraints;  intermediate- scale  problems  having  from  about  five  to  a  hundred 
or  a  thousand  variables;  and  large-scale  problems  having  perhaps  thousands  or  even 
millions  of  variables  and  constraints.  This  classification  is  not  entirely  rigid,  but 
it  reflects  at  least  roughly  not  only  size  but  the  basic  differences  in  approach  that 
accompany  different  size  problems.  As  a  rough  rule,  small-scale  problems  can  be 
solved  by  hand  or  by  a  small  computer.  Intermediate-scale  problems  can  be  solved 
on  a  personal  computer  with  general  purpose  mathematical  programming  codes. 
Large-scale  problems  require  sophisticated  codes  that  exploit  special  structure  and 
usually  require  large  computers. 

Much  of  the  basic  theory  associated  with  optimization,  particularly  in  non¬ 
linear  programming,  is  directed  at  obtaining  necessary  and  sufficient  conditions 
satisfied  by  a  solution  point,  rather  than  at  questions  of  computation.  This  theory 
involves  mainly  the  study  of  Lagrange  multipliers,  including  the  Karush-Kuhn- 
Tucker  Theorem  and  its  extensions.  It  tremendously  enhances  insight  into  the  phi¬ 
losophy  of  constrained  optimization  and  provides  satisfactory  basic  foundations  for 
other  important  disciplines,  such  as  the  theory  of  the  firm,  consumer  economics, 
and  optimal  control  theory.  The  interpretation  of  Lagrange  multipliers  that  accom¬ 
panies  this  theory  is  valuable  in  virtually  every  optimization  setting.  As  a  basis  for 
computing  numerical  solutions  to  optimization,  however,  this  theory  is  far  from  ade¬ 
quate,  since  it  does  not  consider  the  difficulties  associated  with  solving  the  equations 
resulting  from  the  necessary  conditions. 

If  it  is  acknowledged  from  the  outset  that  a  given  problem  is  too  large  and  too 
complex  to  be  efficiently  solved  by  hand  (and  hence  it  is  acknowledged  that  a 
computer  solution  is  desirable),  then  one’s  theory  should  be  directed  toward  devel¬ 
opment  of  procedures  that  exploit  the  efficiencies  of  computers.  In  most  cases  this 
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leads  to  the  abandonment  of  the  idea  of  solving  the  set  of  necessary  conditions  in 
favor  of  the  more  direct  procedure  of  searching  through  the  space  (in  an  intelligent 
manner)  for  ever-improving  points. 

Today,  search  techniques  can  be  effectively  applied  to  more  or  less  general  non¬ 
linear  programming  problems.  Problems  of  great  size,  large-scale  programming 
problems,  can  be  solved  if  they  possess  special  structural  characteristics,  especially 
sparsity,  that  can  be  exploited  by  a  solution  method.  Today  linear  programming  soft¬ 
ware  packages  are  capable  of  automatically  identifying  sparse  structure  within  the 
input  data  and  taking  advantage  of  this  sparsity  in  numerical  computation.  It  is  now 
not  uncommon  to  solve  linear  programs  of  up  to  a  million  variables  and  constraints, 
as  long  as  the  structure  is  sparse.  Problem-dependent  methods,  where  the  structure 
is  not  automatically  identified,  are  largely  directed  to  transportation  and  network 
flow  problems  as  discussed  in  the  book. 

This  book  focuses  on  the  aspects  of  general  theory  that  are  most  fruitful  for 
computation  in  the  widest  class  of  problems.  While  necessary  and  sufficient  con¬ 
ditions  are  examined  and  their  application  to  small-scale  problems  is  illustrated,  our 
primary  interest  in  such  conditions  is  in  their  role  as  the  core  of  a  broader  theory 
applicable  to  the  solution  of  larger  problems.  At  the  other  extreme,  although  some 
instances  of  structure  exploitation  are  discussed,  we  focus  primarily  on  the  general 
continuous  variable  programming  problem  rather  than  on  special  techniques  for  spe¬ 
cial  structures. 


1.4  Iterative  Algorithms  and  Convergence 

The  most  important  characteristic  of  a  high-speed  computer  is  its  ability  to  per¬ 
form  repetitive  operations  efficiently,  and  in  order  to  exploit  this  basic  character¬ 
istic,  most  algorithms  designed  to  solve  large  optimization  problems  are  iterative 
in  nature.  Typically,  in  seeking  a  vector  that  solves  the  programming  problem,  an 
initial  vector  xo  is  selected  and  the  algorithm  generates  an  improved  vector  xi .  The 
process  is  repeated  and  a  still  better  solution  X2  is  found.  Continuing  in  this  fashion, 
a  sequence  of  ever-improving  points  x0,  Xi,  . . . ,  xk, . . .,  is  found  that  approaches  a 
solution  point  x* .  For  linear  programming  problems  solved  by  the  simplex  method, 
the  generated  sequence  is  of  finite  length,  reaching  the  solution  point  exactly  after  a 
finite  (although  initially  unspecified)  number  of  steps.  For  nonlinear  programming 
problems  or  interior-point  methods,  the  sequence  generally  does  not  ever  exactly 
reach  the  solution  point,  but  converges  toward  it.  In  operation,  the  process  is  termi¬ 
nated  when  a  point  sufficiently  close  to  the  solution  point,  for  practical  purposes,  is 
obtained. 

The  theory  of  iterative  algorithms  can  be  divided  into  three  (somewhat  overlap¬ 
ping)  aspects.  The  first  is  concerned  with  the  creation  of  the  algorithms  themselves. 
Algorithms  are  not  conceived  arbitrarily,  but  are  based  on  a  creative  examination 
of  the  programming  problem,  its  inherent  structure,  and  the  efficiencies  of  digital 
computers.  The  second  aspect  is  the  verification  that  a  given  algorithm  will  in  fact 
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generate  a  sequence  that  converges  to  a  solution  point.  This  aspect  is  referred  to  as 
global  convergence  analysis ,  since  it  addresses  the  important  question  of  whether 
the  algorithm,  when  initiated  far  from  the  solution  point,  will  eventually  converge 
to  it.  The  third  aspect  is  referred  to  as  local  convergence  analysis  or  complexity 
analysis  and  is  concerned  with  the  rate  at  which  the  generated  sequence  of  points 
converges  to  the  solution.  One  cannot  regard  a  problem  as  solved  simply  because 
an  algorithm  is  known  which  will  converge  to  the  solution,  since  it  may  require 
an  exorbitant  amount  of  time  to  reduce  the  error  to  an  acceptable  tolerance.  It  is 
essential  when  prescribing  algorithms  that  some  estimate  of  the  time  required  be 
available.  It  is  the  convergence-rate  aspect  of  the  theory  that  allows  some  quantita¬ 
tive  evaluation  and  comparison  of  different  algorithms,  and  at  least  crudely,  assigns 
a  measure  of  tractability  to  a  problem,  as  discussed  in  Sect.  1.1. 

A  modern-day  technical  version  of  Confucius’  most  famous  saying,  and  one 
which  represents  an  underlying  philosophy  of  this  book,  might  be,  “One  good  theory 
is  worth  a  thousand  computer  runs.”  Thus,  the  convergence  properties  of  an  itera¬ 
tive  algorithm  can  be  estimated  with  confidence  either  by  performing  numerous 
computer  experiments  on  different  problems  or  by  a  simple  well-directed  theoreti¬ 
cal  analysis.  A  simple  theory,  of  course,  provides  invaluable  insight  as  well  as  the 
desired  estimate. 

For  linear  programming  using  the  simplex  method,  solid  theoretical  statements 
on  the  speed  of  convergence  were  elusive,  because  the  method  actually  converges  to 
an  exact  solution  in  a  finite  number  of  steps.  The  question  is  how  many  steps  might 
be  required.  This  question  was  finally  resolved  when  it  was  shown  that  it  was  possi¬ 
ble  for  the  number  of  steps  to  be  exponential  in  the  size  of  the  program.  The  situa¬ 
tion  is  different  for  interior  point  algorithms,  which  essentially  treat  the  problem  by 
introducing  nonlinear  terms,  and  which  therefore  do  not  generally  obtain  a  solution 
in  a  finite  number  of  steps  but  instead  converge  toward  a  solution. 

For  nonlinear  programs,  including  interior  point  methods  applied  to  linear  pro¬ 
grams,  it  is  meaningful  to  consider  the  speed  of  convergence.  There  are  many 
different  classes  of  nonlinear  programming  algorithms,  each  with  its  own  conver¬ 
gence  characteristics.  However,  in  many  cases  the  convergence  properties  can  be 
deduced  analytically  by  fairly  simple  means,  and  this  analysis  is  substantiated  by 
computational  experience.  Presentation  of  convergence  analysis,  which  seems  to  be 
the  natural  focal  point  of  a  theory  directed  at  obtaining  specific  answers,  is  a  unique 
feature  of  this  book. 

There  are  in  fact  two  aspects  of  convergence-rate  theory.  The  first  is  generally 
known  as  complexity  analysis  and  focuses  on  how  fast  the  method  converges  over¬ 
all,  distinguishing  between  polynomial-time  algorithms  and  non-polynomial-time 
algorithms.  The  second  aspect  provides  more  detailed  analysis  of  how  fast  the 
method  converges  in  the  final  stages,  and  can  provide  comparisons  between  dif¬ 
ferent  algorithms.  Both  of  these  are  treated  in  this  book. 

The  convergence-rate  theory  presented  has  two  somewhat  surprising  but  definitely 
pleasing  aspects.  First,  the  theory  is,  for  the  most  part,  extremely  simple  in  nature. 
Although  initially  one  might  fear  that  a  theory  aimed  at  predicting  the  speed  of 
convergence  of  a  complex  algorithm  might  itself  be  doubly  complex,  in  fact  the 
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associated  convergence  analysis  often  turns  out  to  be  exceedingly  elementary,  re¬ 
quiring  only  a  line  or  two  of  calculation.  Second,  a  large  class  of  seemingly  distinct 
algorithms  turns  out  to  have  a  common  convergence  rate.  Indeed,  as  emphasized 
in  the  later  chapters  of  the  book,  there  is  a  canonical  rate  associated  with  a  given 
programming  problem  that  seems  to  govern  the  speed  of  convergence  of  many  algo¬ 
rithms  when  applied  to  that  problem.  It  is  this  fact  that  underlies  the  potency  of  the 
theory,  allowing  definitive  comparisons  among  algorithms  to  be  made  even  without 
detailed  knowledge  of  the  problems  to  which  they  will  be  applied.  Together  these 
two  properties,  simplicity  and  potency,  assure  convergence  analysis  a  permanent 
position  of  major  importance  in  mathematical  programming  theory. 


Part  I 

Linear  Programming 


Chapter  2 

Basic  Properties  of  Linear  Programs 


2.1  Introduction 

A  linear  program  (LP)  is  an  optimization  problem  in  which  the  objective  function 
is  linear  in  the  unknowns  and  the  constraints  consist  of  linear  equalities  and  linear 
inequalities.  The  exact  form  of  these  constraints  may  differ  from  one  problem  to  an¬ 
other,  but  as  shown  below,  any  linear  program  can  be  transformed  into  the  following 
standard  form: 


minimize  c\X\  +  C2X2  +  . . .  +  cnxn 
subject  to  011*1  +  012*2  +  •  •  •  +  ainXn  =  b  1 

021*1  +  022*2  +  •  •  •  +  02n*n  =  ^2 


and 


0/77  1  *  1  3“  0/772*2  +  '  '  '  +  0/77/7  */7  bm 
*1  >  0,  *2  ^  0,  ...,*„>  0, 


(2.1) 


where  the  hf  s,  c\  s  and  atf  s  are  fixed  real  constants,  and  the  */’s  are  real  numbers  to 
be  determined.  We  always  assume  that  each  equation  has  been  multiplied  by  minus 
unity,  if  necessary,  so  that  each  >0. 

In  more  compact  vector  notation,  this  standard  problem  becomes 

nr 

minimize  c  x 

subject  to  Ax  =  b  and  x>  0.  (2.2) 

Here  x  is  an  0-dimensional  column  vector,  cT  is  an  0-dimensional  row  vector,  A  is 
an  0z  x  0  matrix,  and  b  is  an  07-dimensional  column  vector.  The  vector  inequality 
x  >  0  means  that  each  component  of  x  is  nonnegative. 


1  See  Appendix  A  for  a  description  of  the  vector  notation  used  throughout  this  book. 
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2  Basic  Properties  of  Linear  Programs 


Before  giving  some  examples  of  areas  in  which  linear  programming  problems 
arise  naturally,  we  indicate  how  various  other  forms  of  linear  programs  can  be  con¬ 
verted  to  the  standard  form. 

Example  1  (Slack  Variables).  Consider  the  problem 

minimize  c\X\  +  C2X2  P  •  •  •  P  cnxn 
subject  to  1*1  +  012V2  +  •  •  •  +  a\nxn  <  b\ 

021*1  +  022*2  +  *  *  *  +  02 77*77  ^  ^2 


0/nl*l  P  0m2*2  P  '  '  '  P  amnXn  ^  bm 
and  *i>0,  *2^0,  . . . ,  *„  >  0, 

In  this  case  the  constraint  set  is  determined  entirely  by  linear  inequalities. 
The  problem  may  be  alternatively  expressed  as 

minimize  c\X\  +  C2*2  +  •  •  •  +  cnxn 
Subject  tO  0n*i  +  012*2  +  •  •  •  +  01n*n  +  3l 

021*1  +  022*2  +  *  ‘  *  +  02 77*77  +  32 

0/771*1  P  0/772  *2  +  '  '  '  +  0/77/7*77  P  3  /77 

and  *1  >  0,  *2  >  0,  . . . ,  *„  >  0, 

and  31  >  0,  y2  >  0,  . . . ,  ym  >  0. 

The  new  positive  variables  yi  introduced  to  convert  the  inequalities  to  equalities 
are  called  slack  variables  (or  more  loosely,  slacks).  By  considering  the  problem 
as  one  having  n  +  m  unknowns  *i,  *2,  . . . ,  *n,  yi,  y 2,  . . . ,  ym,  the  problem  takes 
the  standard  form.  The  m  x  (0  +  m)  matrix  that  now  describes  the  linear  equality 
constraints  is  of  the  special  form  [A,  I]  (that  is,  its  columns  can  be  partitioned  into 
two  sets;  the  first  n  columns  make  up  the  original  A  matrix  and  the  last  m  columns 
make  up  an  m  x  m  identity  matrix). 

Example  2  ( Surplus  Variables).  If  the  linear  inequalities  of  Example  1  are  reversed 
so  that  a  typical  inequality  is 


=  b  1 

=  b2 


=  b 


777 


0/1*1  +  0/2*2  +  *  *  *  +  0/77*77  >  b i , 


it  is  clear  that  this  is  equivalent  to 


0/1*1  P  0/2*2  P  ’  ’  ’  P  0/77*77  y i  — 

with  yt  >  0.  Variables,  such  as  yi9  adjoined  in  this  fashion  to  convert  a  “greater  than 
or  equal  to”  inequality  to  equality  are  called  surplus  variables. 

It  should  be  clear  that  by  suitably  multiplying  by  minus  unity,  and  adjoining  slack 
and  surplus  variables,  any  set  of  linear  inequalities  can  be  converted  to  standard  form 
if  the  unknown  variables  are  restricted  to  be  nonnegative. 
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Example  3  ( Free  Variables — First  Method).  If  a  linear  program  is  given  in  standard 
form  except  that  one  or  more  of  the  unknown  variables  is  not  required  to  be  non¬ 
negative,  the  problem  can  be  transformed  to  standard  form  by  either  of  two  simple 
techniques. 

To  describe  the  first  technique,  suppose  in  (2.1),  for  example,  that  the  restriction 
x\  >  0  is  not  present  and  hence  x\  is  free  to  take  on  either  positive  or  negative 
values.  We  then  write 

x\  =  u\  -  vi,  (2.3) 

where  we  require  u\  >  0  and  v\  >  0.  If  we  substitute  u\  -  v\  for  x\  everywhere  in 
(2.1),  the  linearity  of  the  constraints  is  preserved  and  all  variables  are  now  required 
to  be  nonnegative.  The  problem  is  then  expressed  in  terms  of  the  n  +  1  variables 

^  1  ?  ^  1  •>  %2  ?  %3  ?  •  •  •  ?  3Cn  • 

There  is  obviously  a  certain  degree  of  redundancy  introduced  by  this  technique, 
however,  since  a  constant  added  to  u\  and  v\  does  not  change  x\  (that  is,  the  rep¬ 
resentation  of  a  given  value  x\  is  not  unique).  Nevertheless,  this  does  not  hinder 
the  simplex  method  of  solution. 

Example  4  ( Free  Variables — Second  Method).  A  second  approach  for  converting  to 
standard  form  when  x\  is  unconstrained  in  sign  is  to  eliminate  x\  together  with  one 
of  the  constraint  equations.  Take  any  one  of  the  m  equations  in  (2.1)  which  has  a 
nonzero  coefficient  for  x\.  Say,  for  example, 


Q>i\x\  +  ai2x2  +  •  •  •  +  ainxn  =  bu  (2.4) 

where  an  4  0.  Then  x\  can  be  expressed  as  a  linear  combination  of  the  other  vari¬ 
ables  plus  a  constant.  If  this  expression  is  substituted  for  x\  everywhere  in  (2.1), 
we  are  led  to  a  new  problem  of  exactly  the  same  form  but  expressed  in  terms  of 
the  variables  only.  Furthermore,  the  ith  equation,  used  to  determine 

xi,  is  now  identically  zero  and  it  too  can  be  eliminated.  This  substitution  scheme 
is  valid  since  any  combination  of  nonnegative  variables  *2,  X3,  . . . ,  xn  leads  to  a 
feasible  x\  from  (2.4),  since  the  sign  of  x\  is  unrestricted.  As  a  result  of  this  sim¬ 
plification,  we  obtain  a  standard  linear  program  having  n  -  1  variables  and  m  -  1 
constraint  equations.  The  value  of  the  variable  x\  can  be  determined  after  solution 
through  (2.4). 

Example  5  (Specific  Case).  As  a  specific  instance  of  the  above  technique  consider 
the  problem 


minimize  x\  +  3x2  +  4x3 
subject  to  x\  +  2x2  +  X3  =  5 
2xi  +  3x2  +  X3  =  6 
X2  >  0,  X3  >  0. 

Since  x\  is  free,  we  solve  for  it  from  the  first  constraint,  obtaining 
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x\  =  5  -  2x2  ~  *3.  (2.5) 

Substituting  this  into  the  objective  and  the  second  constraint,  we  obtain  the  equiva¬ 
lent  problem  (subtracting  five  from  the  objective) 

minimize  X2  +  3^3 
subject  to  *2  +  *3  =  4 

*2  >  0,  x3  >  0, 


which  is  a  problem  in  standard  form.  After  the  smaller  problem  is  solved  (the  answer 
is  X2  =  4,  x3  =  0)  the  value  for  x\(x\  =  -3)  can  be  found  from  (2.5). 


2.2  Examples  of  Linear  Programming  Problems 

Linear  programming  has  long  proved  its  merit  as  a  significant  model  of  numerous 
allocation  problems  and  economic  phenomena.  The  continuously  expanding  litera¬ 
ture  of  applications  repeatedly  demonstrates  the  importance  of  linear  programming 
as  a  general  framework  for  problem  formulation.  In  this  section  we  present  some 
classic  examples  of  situations  that  have  natural  formulations. 

Example  1  ( The  Diet  Problem).  How  can  we  determine  the  most  economical  diet 
that  satisfies  the  basic  minimum  nutritional  requirements  for  good  health?  Such  a 
problem  might,  for  example,  be  faced  by  the  dietitian  of  a  large  army.  We  assume 
that  there  are  available  at  the  market  n  different  foods  and  that  the  jth  food  sells  at  a 
price  cj  per  unit.  In  addition  there  are  m  basic  nutritional  ingredients  and,  to  achieve 
a  balanced  diet,  each  individual  must  receive  at  least  bi  units  of  the  ith  nutrient  per 
day.  Finally,  we  assume  that  each  unit  of  food  j  contains  atj  units  of  the  ith  nutrient. 

If  we  denote  by  xj  the  number  of  units  of  food  j  in  the  diet,  the  problem  then  is 
to  select  the  xfs  to  minimize  the  total  cost 

C\X\  +  C2X2  +  •  •  •  +  cnxn 

subject  to  the  nutritional  constraints 

<2/1*1  +  <2/2*2  +  •  •  •  +  ClinXn  ^  bi,  i  =  1, . . . ,  m, 
and  the  nonnegativity  constraints 

x\>0,  x2>0,  xn>0 


on  the  food  quantities. 

This  problem  can  be  converted  to  standard  form  by  subtracting  a  nonnegative 
surplus  variable  from  the  left  side  of  each  of  the  m  linear  inequalities.  The  diet 
problem  is  discussed  further  in  Chap.  4. 

Example  2  (Manufacturing  Problem ).  Suppose  we  own  a  facility  that  is  capable  of 
manufacturing  n  different  products,  each  of  which  may  require  various  amounts 
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of  m  different  resources.  Each  product  can  be  produced  at  any  level  Xj  >  0, 
j  =  1, 2, . . . ,  n,  and  each  unit  of  the  jth  product  can  sell  for  pj  dollars  and  needs 
ciij  units  of  the  ith  resource,  i  =  1, 2, . . . ,  m.  Assuming  linearity  of  the  production 
facility,  if  we  are  given  a  set  of  m  numbers  b\,  £>2 ,  •  •  • ,  bm  describing  the  available 
quantities  of  the  m  resources,  and  we  wish  to  manufacture  products  at  maximum 
revenue,  ours  decision  problem  is  a  linear  program  to  maximize 


p  1*1  +  P2*2  +  •  •  •  +  PnXn 

subject  to  the  resource  constraints 


anX\  +  <2/2*2  +  •  •  •  +  CLinXn  <  bt,  i  =  1, . . . ,  m 

and  the  nonnegativity  constraints  on  all  production  variables. 

Example  3  (The  Transportation  Problem).  Quantities  a\ ,  <22,  . . . ,  am,  respectively, 
of  a  certain  product  are  to  be  shipped  from  each  of  m  locations  and  received  in 
amounts  b\,  £>2,  •  • . ,  bn ,  respectively,  at  each  of  n  destinations.  Associated  with  the 
shipping  of  a  unit  of  product  from  origin  i  to  destination  j  is  a  shipping  cost  ctj.  It  is 
desired  to  determine  the  amounts  xij  to  be  shipped  between  each  origin-destination 
pair  i  =  1,2,  . . . ,  m\  j  —  1,2,  . . . ,  n\  so  as  to  satisfy  the  shipping  requirements  and 
minimize  the  total  cost  of  transportation. 

To  formulate  this  problem  as  a  linear  programming  problem,  we  set  up  the  array 
shown  below: 


*11 

*12 

■ 

i 

a 

(M  . 
* 

*22 

-*2« 

* 

*m2 

i 

a  1 
#2 


The  ith  row  in  this  array  defines  the  variables  associated  with  the  ith  origin,  while 
the  jth  column  in  this  array  defines  the  variables  associated  with  the  jth  destina¬ 
tion.  The  problem  is  to  place  nonnegative  variables  x^  in  this  array  so  that  the  sum 
across  the  ith  row  is  aj ,  the  sum  down  the  jth  column  is  bj ,  and  the  weighted  sum 
Tl}=i  YZi  cijxij >  representing  the  transportation  cost,  is  minimized. 

Thus,  we  have  the  linear  programming  problem: 


minimize 

yi  CUXij 

ij 

subject  to 

II 

• 

K 

7=1 

EM3 

X 

II 

for  i  =  1,2,  . . . ,  m 


(2.6) 


for  j  =  1,2,  . . . ,  n 


(2.7) 
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Xij  >  0  for  i  =  1,2,  . . . ,  m;  j  -  1,2,  . . . ,  n. 

In  order  that  the  constraints  (2.6)  and  (2.7)  be  consistent,  we  must,  of  course, 
assume  that  Yn=\  ai  —  Z”=i  bj  which  corresponds  to  assuming  that  the  total  amount 
shipped  is  equal  to  the  total  amount  received. 

The  transportation  problem  is  now  clearly  seen  to  be  a  linear  programming  prob¬ 
lem  in  mn  variables.  The  equations  (2.6)  and  (2.7)  can  be  combined  and  expressed 
in  matrix  form  in  the  usual  manner  and  this  results  in  an  (m  +  n)  x  (mn)  coefficient 
matrix  consisting  of  zeros  and  ones  only. 


Fig.  2.1  A  network  with  capacities 


Example  4  ( The  Maximal  Flow  Problem).  Consider  a  capacitated  network  (see 
Fig.  2.1,  and  Appendix  D)  in  which  two  special  nodes,  called  the  source  and  the 
sink,  are  distinguished.  Say  they  are  nodes  1  and  m,  respectively.  All  other  nodes 
must  satisfy  the  strict  conservation  requirement;  that  is,  the  net  flow  into  these  nodes 
must  be  zero.  However,  the  source  may  have  a  net  outflow  and  the  sink  a  net  inflow. 
The  outflow  /  of  the  source  will  equal  the  inflow  of  the  sink  as  a  consequence  of 
the  conservation  at  all  other  nodes.  A  set  of  arc  flows  satisfying  these  conditions 
is  said  to  be  a  flow  in  the  network  of  value  /.  The  maximal  flow  problem  is  that 
of  determining  the  maximal  flow  that  can  be  established  in  such  a  network.  When 
written  out,  it  takes  the  form 


minimize  / 


n 


n 


subject  to  ^  x\j  -  ^  xp  -  f  -  0 
7-1  7-1 


n 


n 


Xij  ^  Xji  ~  O' 

7-1  7-1 


i  A  1,  m 


n 


n 


^  ^  Xmj  ^  ^  %  jm  "f  f  —  0 
7-1  7-1 

0  <  Xij  <  kij ,  forall  i,  j , 


(2.8) 


where  kij  =  0  for  those  no-arc  pairs  (/,  j). 
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Example  5  (A  Warehousing  Problem).  Consider  the  problem  of  operating  a  ware¬ 
house,  by  buying  and  selling  the  stock  of  a  certain  commodity,  in  order  to  maximize 
profit  over  a  certain  length  of  time.  The  warehouse  has  a  fixed  capacity  C,  and  there 
is  a  cost  r  per  unit  for  holding  stock  for  one  period.  The  price,  pu  of  the  commod¬ 
ity  is  known  to  fluctuate  over  a  number  of  time  periods — say  months,  indexed  by 
i.  In  any  period  the  same  price  holds  for  both  purchase  or  sale.  The  warehouse  is 
originally  empty  and  is  required  to  be  empty  at  the  end  of  the  last  period. 

To  formulate  this  problem,  variables  are  introduced  for  each  time  period.  In  par¬ 
ticular,  let  Xi  denote  the  level  of  stock  in  the  warehouse  at  the  beginning  of  period  i. 
Let  ui  denote  the  amount  bought  during  period  /,  and  let  Si  denote  the  amount  sold 
during  period  i.  If  there  are  n  periods,  the  problem  is 

n 

maximize  £ (/?,(>,•  -  u,)  -  rx,) 

i- 1 

subject  to  Xi+ 1  =  Xi  +  Ui  —  Si  i  =  1, 2,  . . . ,  n  -  1 

0  —  Xn  “f  an  sn 

Xi  +  Zi  -  C  i  =  2,  . . . ,  n 

x\  -  0,  Xi  >  0,  Ui  >  0,  Si  >  0,  Zi  >  0, 

where  Zi  is  a  slack  variable.  If  the  constraints  are  written  out  explicitly  for  the  case 
n  =  3,  they  take  the  form 


-U\  +  S\ 

+X2 

=0 

-x2  -u2  +  s2 

x2  +Z2 

+  X3 

=0 

-C 

-X3  -  M3  +  S3 

X3  +  Z3 

=0 

-C 

Note  that  the  coefficient  matrix  can  be  partitioned  into  blocks  corresponding  to 
the  variables  of  the  different  time  periods.  The  only  blocks  that  have  nonzero  entries 
are  the  diagonal  ones  and  the  ones  immediately  above  the  diagonal.  This  structure 
is  typical  of  problems  involving  time. 

Example  6  (Linear  Classifier  and  Support  Vector  Machine).  Suppose  several 
d-dimensional  data  points  are  classified  into  two  distinct  classes.  For  example,  two- 
dimensional  data  points  may  be  grade  averages  in  science  and  humanities  for  differ¬ 
ent  students.  We  also  know  the  academic  major  of  each  student,  as  being  in  science 
or  humanities,  which  serves  as  the  classification.  In  general  we  have  vectors  a ;  e  Ed 
for  i  =  1,2,  . . . ,  n\  and  vectors  by  e  Ed  for  j  =  1,2,  . . . ,  We  wish  to  find 
a  hyperplane  that  separates  the  a*’s  from  the  b/s.  Mathematically  we  wish  to  find 
y  e  Ed  and  a  number  fi  such  that 

afy + P  ^  i  f°r  aii  i 

bTj  y  +  fi  <  -1  for  all  j, 

where  {x  :  xTy  +  j3  =  0}  is  the  desired  hyperplane,  and  the  separation  is  defined  by 
the  +1  and  -1.  This  is  a  linear  program.  See  Fig.  2.2. 
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Example  7  ( Combinatorial  Auction).  Suppose  there  are  m  mutually  exclusive  po¬ 
tential  states  and  only  one  of  them  will  be  true  at  maturity.  For  example,  the  states 
may  correspond  to  the  winning  horse  in  a  race  of  m  horses,  or  the  value  of  a  stock 
index,  falling  within  m  intervals.  An  auction  organizer  who  establishes  a  parimutuel 
auction  is  prepared  to  issue  contracts  specifying  subsets  of  the  m  possibilities  that 
pay  $1  if  the  final  state  is  one  of  those  designated  by  the  contract,  and  zero  oth¬ 
erwise.  There  are  n  participants  who  may  place  orders  with  the  organizer  for  the 
purchase  of  such  contracts.  An  order  by  the  jth  participant  consists  of  an  m- vector 
aj  =  (a\j,  a2j,  . . . ,  amj)T  where  each  component  is  either  0  or  1,  a  one  indicating  a 
desire  to  be  paid  if  the  corresponding  state  occurs. 


Fig.  2.2  Support  vector  for  data  classification 


Accompanying  the  order  is  a  number  tij  which  is  the  price  limit  the  participant 
is  willing  to  pay  for  one  unit  of  the  order.  Finally,  the  participant  also  declares  the 
maximum  number  qj  of  units  he  or  she  is  willing  to  accept  under  these  terms. 

The  auction  organizer,  after  receiving  these  various  orders,  must  decide  how 
many  contracts  to  fill.  Let  Xj  be  the  (real)  number  of  units  awarded  to  the  jth  or¬ 
der.  Then  the  jth  participant  will  pay  7 TjXj.  The  total  amount  paid  by  all  participants 
is  7rrx,  where  x  is  the  vector  of  x/s  and  n  is  the  vector  of  prices. 

If  the  outcome  is  the  ith  state,  the  auction  organizer  must  pay  out  a  total  of 
2”=  i  &ijXj  =  (Ax) j.  The  organizer  would  like  to  maximize  profit  in  the  worst  possi¬ 
ble  case,  and  does  this  by  solving  the  problem 

maximize  7rTx-  max/ (Ax)/ 
subject  to  0  <  x  <  q. 
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This  problem  can  be  expressed  alternatively  as  selecting  x  and  scalar  s  to 

maximize  7rTx-  s 
subject  to  Ax  -  Is  <  0 
0  <  x  <  q 

where  1  is  the  vector  of  all  l’s.  Notice  that  the  profit  will  always  be  nonnegative, 
since  x  =  0  is  feasible. 


2.3  Basic  Solutions 

Consider  the  system  of  equalities 


Ax  =  b,  (2.9) 

where  x  is  an  ^-vector,  b  an  m-vector,  and  A  is  an  m  x  n  matrix.  Suppose  that  from 
the  n  columns  of  A  we  select  a  set  of  m  linearly  independent  columns  (such  a  set 
exists  if  the  rank  of  A  is  m).  For  notational  simplicity  assume  that  we  select  the  first 
m  columns  of  A  and  denote  the  m  x  m  matrix  determined  by  these  columns  by  B. 
The  matrix  B  is  then  nonsingular  and  we  may  uniquely  solve  the  equation. 

Bxb  =  b  (2.10) 

for  the  m-vector  xB.  By  putting  x  =  (xB,  0)  (that  is,  setting  the  first  m  components 
of  x  equal  to  those  of  xB  and  the  remaining  components  equal  to  zero),  we  obtain  a 
solution  to  Ax  =  b.  This  leads  to  the  following  definition. 

Definition.  Given  the  set  of  m  simultaneous  linear  equations  in  n  unknowns  (2.9),  let  B  be 
any  nonsingular  mxm  submatrix  made  up  of  columns  of  A.  Then,  if  all  n  -m  components  of 
x  not  associated  with  columns  of  B  are  set  equal  to  zero,  the  solution  to  the  resulting  set  of 
equations  is  said  to  be  a  basic  solution  to  (2.9)  with  respect  to  the  basis  B.  The  components 
of  x  associated  with  columns  of  B  are  called  basic  variables. 

In  the  above  definition  we  refer  to  B  as  a  basis,  since  B  consists  of  m  linearly 

independent  columns  that  can  be  regarded  as  a  basis  for  the  space  Em.  The  basic 

solution  corresponds  to  an  expression  for  the  vector  b  as  a  linear  combination  of 
these  basis  vectors.  This  interpretation  is  discussed  further  in  the  next  section. 

In  general,  of  course,  Eq.  (2.9)  may  have  no  basic  solutions.  However,  we  may 
avoid  trivialities  and  difficulties  of  a  nonessential  nature  by  making  certain  elemen¬ 
tary  assumptions  regarding  the  structure  of  the  matrix  A.  First,  we  usually  assume 
that  n  >  m,  that  is,  the  number  of  variables  Xj  exceeds  the  number  of  equality  con¬ 
straints.  Second,  we  usually  assume  that  the  rows  of  A  are  linearly  independent,  cor¬ 
responding  to  linear  independence  of  the  m  equations.  A  linear  dependency  among 
the  rows  of  A  would  lead  either  to  contradictory  constraints  and  hence  no  solutions 
to  (2.9),  or  to  a  redundancy  that  could  be  eliminated.  Formally,  we  explicitly  make 
the  following  assumption  in  our  development,  unless  noted  otherwise. 
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Full  Rank  Assumption.  The  rax  n  matrix  A  has  m  <  n,  and  the  m  rows  of  A  are  linearly 
independent. 

Under  the  above  assumption,  the  system  (2.9)  will  always  have  a  solution  and,  in 
fact,  it  will  always  have  at  least  one  basic  solution. 

The  basic  variables  in  a  basic  solution  are  not  necessarily  all  nonzero.  This  is 
noted  by  the  following  definition. 

Definition.  If  one  or  more  of  the  basic  variables  in  a  basic  solution  has  value  zero,  that 
solution  is  said  to  be  a  degenerate  basic  solution. 

We  note  that  in  a  nondegenerate  basic  solution  the  basic  variables,  and  hence  the 
basis  B,  can  be  immediately  identified  from  the  positive  components  of  the  solution. 
There  is  ambiguity  associated  with  a  degenerate  basic  solution,  however,  since  the 
zero-valued  basic  and  some  of  nonbasic  variables  can  be  interchanged. 

So  far  in  the  discussion  of  basic  solutions  we  have  treated  only  the  equality  con¬ 
straint  (2.9)  and  have  made  no  reference  to  positivity  constraints  on  the  variables. 
Similar  definitions  apply  when  these  constraints  are  also  considered.  Thus,  consider 
now  the  system  of  constraints 


Ax  =  b,  x  >  0,  (2.11) 

which  represent  the  constraints  of  a  linear  program  in  standard  form. 

Definition.  A  vector  x  satisfying  (2.1 1)  is  said  to  b q  feasible  for  these  constraints.  A  feasi¬ 
ble  solution  to  the  constraints  (2.11)  that  is  also  basic  is  said  to  be  a  basic  feasible  solution ; 
if  this  solution  is  also  a  degenerate  basic  solution,  it  is  called  a  degenerate  basic  feasible 
solution. 


2.4  The  Fundamental  Theorem  of  Linear  Programming 

In  this  section,  through  the  fundamental  theorem  of  linear  programming,  we  estab¬ 
lish  the  primary  importance  of  basic  feasible  solutions  in  solving  linear  programs. 
The  method  of  proof  of  the  theorem  is  in  many  respects  as  important  as  the  result 
itself,  since  it  represents  the  beginning  of  the  development  of  the  simplex  method. 
The  theorem  (due  to  Caratheodory)  itself  shows  that  it  is  necessary  only  to  con¬ 
sider  basic  feasible  solutions  when  seeking  an  optimal  solution  to  a  linear  program 
because  the  optimal  value  is  always  achieved  at  such  a  solution. 

Corresponding  to  a  linear  program  in  standard  form 

minimize  crx 

subject  to  Ax  =  b,  x  >  0  (2.12) 

a  feasible  solution  to  the  constraints  that  achieves  the  minimum  value  of  the  objec¬ 
tive  function  subject  to  those  constraints  is  said  to  be  an  optimal  feasible  solution. 
If  this  solution  is  basic,  it  is  an  optimal  basic  feasible  solution. 


2.4  The  Fundamental  Theorem  of  Linear  Programming 
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Fundamental  Theorem  of  Linear  Programming.  Given  a  linear  program  in  standard  form 
(2.12)  where  A  is  an  m  x  n  matrix  of  rank  m, 

i)  if  there  is  a  feasible  solution,  there  is  a  basic  feasible  solution ; 

ii)  if  there  is  an  optimal  feasible  solution,  there  is  an  optimal  basic  feasible  solution. 

Proof  of  (i).  Denote  the  columns  of  A  by  ai,  a2,  . . . ,  a„.  Suppose  x  =  (xi,  x2, 
x„)  is  a  feasible  solution.  Then,  in  terms  of  the  columns  of  A,  this  solution  satisfies: 

xi&i  +  X2&2  +  •  •  •  +  XnSLn  =  b. 

Assume  that  exactly  p  of  the  variables  xt  are  greater  than  zero,  and  for  convenience, 
that  they  are  the  first  p  variables.  Thus 


X\2L\  +  X2SL2  + - 1-  Xp2Lp  =  b.  (2.13) 

There  are  now  two  cases,  corresponding  as  to  whether  the  set  ai,  a2,  . . . ,  a^  is 
linearly  independent  or  linearly  dependent. 

Case  1:  Assume  ai,  a2,  . . . ,  ap  are  linearly  independent.  Then  clearly,  p  <  m. 
If  p  =  m,  the  solution  is  basic  and  the  proof  is  complete.  If  p  <  m,  then,  since  A 
has  rank  m,  m  -  p  vectors  can  be  found  from  the  remaining  n  -  p  vectors  so  that 
the  resulting  set  of  m  vectors  is  linearly  independent.  (See  Exercise  12.)  Assign¬ 
ing  the  value  zero  to  the  corresponding  m  -  p  variables  yields  a  (degenerate)  basic 
feasible  solution. 

Case  2:  Assume  ai,  a2,  . . .,  a p  are  linearly  dependent.  Then  there  is  a  non¬ 
trivial  linear  combination  of  these  vectors  that  is  zero.  Thus  there  are  constants 
yu  y2,  •  •  • ,  yp,  at  least  one  of  which  can  be  assumed  to  be  positive,  such  that 

yiai  +  y2a2  +  •  ■  ■  +  yp  ap  =  0.  (2.14) 

Multiplying  this  equation  by  a  scalar  s  and  subtracting  it  from  (2.13),  we  obtain 

(*i  -  sy  i)ai  +  (x2  -  sy  2)a2  +  ---  +  (xp-  syp)  ap  =  b.  (2.15) 

This  equation  holds  for  every  s,  and  for  each  s  the  components  Xj-syj  correspond  to 
a  solution  of  the  linear  equalities — although  they  may  violate  jq  -  syi  >  0.  Denoting 
y  =  (yu  T2,  . . . ,  yp,  0, 0,  . . . ,  0),  we  see  that  for  any  s 


x-sy  (2.16) 

is  a  solution  to  the  equalities.  For  s  =  0,  this  reduces  to  the  original  feasible  solution. 
As  s  is  increased  from  zero,  the  various  components  increase,  decrease,  or  remain 
constant,  depending  upon  whether  the  corresponding  yt  is  negative,  positive,  or  zero. 
Since  we  assume  at  least  one  y*  is  positive,  at  least  one  component  will  decrease  as  s 
is  increased.  We  increase  s  to  the  first  point  where  one  or  more  components  become 
zero.  Specifically,  we  set 


s  =  min {xt/yi  :  yt  >  0}. 
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For  this  value  of  s  the  solution  given  by  (2.16)  is  feasible  and  has  at  most  p  —  1 
positive  variables.  Repeating  this  process  if  necessary,  we  can  eliminate  positive 
variables  until  we  have  a  feasible  solution  with  corresponding  columns  that  are  lin¬ 
early  independent.  At  that  point  Case  1  applies.  I 

Proof  o f  (ii).  Let  x  =  (x\,  X2,  . . . ,  xn)  be  an  optimal  feasible  solution  and,  as  in 
the  proof  of  (i)  above,  suppose  there  are  exactly  p  positive  variables  x\,  X2,  . . . ,  xp. 
Again  there  are  two  cases;  and  Case  1,  corresponding  to  linear  independence,  is 
exactly  the  same  as  before. 

Case  2  also  goes  exactly  the  same  as  before,  but  it  must  be  shown  that  for  any 
s  the  solution  (2.16)  is  optimal.  To  show  this,  note  that  the  value  of  the  solution 
x  -  sy  is 

cTx-scTy.  (2.17) 

For  s  sufficiently  small  in  magnitude,  x  -  £y  is  a  feasible  solution  for  positive  or 
negative  values  of  £.  Thus  we  conclude  that  cTy  =  0.  For,  if  cTy  ±  0,  an  £  of  small 
magnitude  and  proper  sign  could  be  determined  so  as  to  render  (2.17)  smaller  than 
cTx  while  maintaining  feasibility.  This  would  violate  the  assumption  of  optimality 
of  x  and  hence  we  must  have  cry  =  0. 

Having  established  that  the  new  feasible  solution  with  fewer  positive  components 
is  also  optimal,  the  remainder  of  the  proof  may  be  completed  exactly  as  in  part  (i). 

i 


This  theorem  reduces  the  task  of  solving  a  linear  program  to  that  of  searching 
over  basic  feasible  solutions.  Since  for  a  problem  having  n  variables  and  m  con¬ 
straints  there  are  at  most 

In  \  _  n\ 

\mf  m ! (n  -  m) ! 

basic  solutions  (corresponding  to  the  number  of  ways  of  selecting  m  of  n  columns), 
there  are  only  a  finite  number  of  possibilities.  Thus  the  fundamental  theorem  yields 
an  obvious,  but  terribly  inefficient,  finite  search  technique.  By  expanding  upon  the 
technique  of  proof  as  well  as  the  statement  of  the  fundamental  theorem,  the  efficient 
simplex  procedure  is  derived. 

It  should  be  noted  that  the  proof  of  the  fundamental  theorem  given  above  is  of 
a  simple  algebraic  character.  In  the  next  section  the  geometric  interpretation  of  this 
theorem  is  explored  in  terms  of  the  general  theory  of  convex  sets.  Although  the 
geometric  interpretation  is  aesthetically  pleasing  and  theoretically  important,  the 
reader  should  bear  in  mind,  lest  one  be  diverted  by  the  somewhat  more  advanced 
arguments  employed,  the  underlying  elementary  level  of  the  fundamental  theorem. 


2.5  Relations  to  Convexity 
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2.5  Relations  to  Convexity 

Our  development  to  this  point,  including  the  above  proof  of  the  fundamental  theo¬ 
rem,  has  been  based  only  on  elementary  properties  of  systems  of  linear  equations. 
These  results,  however,  have  interesting  interpretations  in  terms  of  the  theory  of 
convex  sets  that  can  lead  not  only  to  an  alternative  derivation  of  the  fundamen¬ 
tal  theorem,  but  also  to  a  clearer  geometric  understanding  of  the  result.  The  main 
link  between  the  algebraic  and  geometric  theories  is  the  formal  relation  between 
basic  feasible  solutions  of  linear  inequalities  in  standard  form  and  extreme  points 
of  poly  topes.  We  establish  this  correspondence  as  follows.  The  reader  is  referred  to 
Appendix  B  for  a  more  complete  summary  of  concepts  related  to  convexity,  but  the 
definition  of  an  extreme  point  is  stated  here. 

Definition.  A  point  x  in  a  convex  set  C  is  said  to  be  an  extreme  point  of  C  if  there  are  no 
two  distinct  points  xi  and  x2  in  C  such  that  x  =  ax i  +  (1  -  a)x2  for  some  a,  0  <  a  <  1. 

An  extreme  point  is  thus  a  point  that  does  not  lie  strictly  within  a  line  segment 
connecting  two  other  points  of  the  set.  The  extreme  points  of  a  triangle,  for  example, 
are  its  three  vertices. 

Theorem  ( Equivalence  of  Extreme  Points  and  Basic  Solutions).  Let  A  be  an  mxn  matrix 
of  rank  m  and  b  an  m-vector.  Let  K  be  the  convex  polytope  consisting  of  all  n-vectors  x 
satisfying 


Ax  =  b,  x>  0.  (2.18) 

A  vector  x  is  an  extreme  point  of  K  if  and  only  ifx  is  a  basic  feasible  solution  to  (2.18). 

Proof.  Suppose  first  that  x  =  (x\,  X2,  . xm,  0,0,  ...,  0)  is  a  basic  feasible 
solution  to  (2.18).  Then 


*iai  +  X2&2  +  •  •  •  +  xmam  —  b, 

where  ai,  a2,  . . . ,  am,  the  first  m  columns  of  A,  are  linearly  independent.  Suppose 
that  x  could  be  expressed  as  a  convex  combination  of  two  other  points  in  K\  say, 
x  =  ay+(l-a)z,  0  <  a  <  1,  y  ^  z.  Since  all  components  of  x,  y,  z  are  nonnegative 
and  since  0  <  a  <  1 ,  it  follows  immediately  that  the  last  n-m  components  of  y  and 
z  are  zero.  Thus,  in  particular,  we  have 


y  tat  +  y2^2  +  ■  ■  ■  +  ymam  =  b 

and 

Z\&1  +  Z2&2  +  •  •  •  +  Zmam  =  b. 

Since  the  vectors  ai,  a2,  . . . ,  am  are  linearly  independent,  however,  it  follows  that 
x  =  y  =  z  and  hence  x  is  an  extreme  point  of  K. 

Conversely,  assume  that  x  is  an  extreme  point  of  K.  Let  us  assume  that  the 
nonzero  components  of  x  are  the  first  k  components.  Then 


x\SL\  +  v2a2  +  •  •  •  +  xkak  =  b, 
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with  Xi  >  0,  i  -  1,2,  . . . ,  k.  To  show  that  x  is  a  basic  feasible  solution  it  must  be 
shown  that  the  vectors  ai,  a2,  . . . ,  are  linearly  independent.  We  do  this  by  con¬ 
tradiction.  Suppose  ai,  a2,  . . . ,  a^  are  linearly  dependent.  Then  there  is  a  nontrivial 
linear  combination  that  is  zero: 


y tai  +  y2a2  +  — f  yk&k  -  0. 


Define  the  ^-vector  y  =  (yi,  y2,  •  •  • ,  yk>  0, 0,  . . . ,  0).  Since  xi  >0,  1  <  i  <  k,  it  is 
possible  to  select  s  such  that 


x  +  sy  >  0,  x  -  sy  >  0. 

We  then  have  x  =  ^(x+ey)  +  ^(x-sy)  which  expresses  x  as  a  convex  combination  of 
two  distinct  vectors  in  K.  This  cannot  occur,  since  x  is  an  extreme  point  of  K.  Thus 
ai,  a2,  . . . ,  SLk  are  linearly  independent  and  x  is  a  basic  feasible  solution.  (Although 
if  k  <  m,  it  is  a  degenerate  basic  feasible  solution.)  I 

This  correspondence  between  extreme  points  and  basic  feasible  solutions  enables 
us  to  prove  certain  geometric  properties  of  the  convex  polytope  K  defining  the  con¬ 
straint  set  of  a  linear  programming  problem. 

Corollary  1.  If  the  convex  set  K  corresponding  to  (2.18)  is  nonempty,  it  has  at  least  one 
extreme  point. 

Proof.  This  follows  from  the  first  part  of  the  Fundamental  Theorem  and  the  Equiv¬ 
alence  Theorem  above.  I 

Corollary  2.  If  there  is  a  finite  optimal  solution  to  a  linear  programming  problem,  there  is 
a  finite  optimal  solution  which  is  an  extreme  point  of  the  constraint  set. 

Corollary  3.  The  constraint  set  K  corresponding  to  (2.18)  possesses  at  most  a  finite  number 
of  extreme  points. 

Proof.  There  are  obviously  only  a  finite  number  of  basic  solutions  obtained  by 
selecting  m  basis  vectors  from  the  n  columns  of  A.  The  extreme  points  of  K  are 
a  subset  of  these  basic  solutions.  I 

Finally,  we  come  to  the  special  case  which  occurs  most  frequently  in  practice  and 
which  in  some  sense  is  characteristic  of  well-formulated  linear  programs — the  case 
where  the  constraint  set  K  is  nonempty  and  bounded.  In  this  case  we  combine  the 
results  of  the  Equivalence  Theorem  and  Corollary  3  above  to  obtain  the  following 
corollary. 

Corollary  4.  If  the  convex  poly  tope  K  corresponding  to  (2.18)  is  bounded,  then  K  is  a  con¬ 
vex  polyhedron,  that  is,  K  consists  of  points  that  are  convex  combinations  of  a  finite  number 
of  points. 

Some  of  these  results  are  illustrated  by  the  following  examples: 


2.5  Relations  to  Convexity 
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Example  1.  Consider  the  constraint  set  in  E 3  defined  by 

X\  +  X2  +  X3  =  1 
X\  >  0,  X2  >  0,  X3  >  0. 


This  set  is  illustrated  in  Fig.  2.3.  It  has  three  extreme  points,  corresponding  to  the 
three  basic  solutions  to  x\  +  X2  +  X3  =  1 . 

Example  2.  Consider  the  constraint  set  in  E 3  defined  by 

X\  +  X2  +  X3  =  1 
2xi  +  3^2  =  1 

xi  >  0,  X2  >  0,  X3  >  0. 


Fig.  2.3  Feasible  set  for  Example  1 


This  set  is  illustrated  in  Fig.  2.4.  It  has  two  extreme  points,  corresponding  to  the 
two  basic  feasible  solutions.  Note  that  the  system  of  equations  itself  has  three  basic 
solutions,  (2,  -1,  0),  (1/2,  0,  1/2  ),  (0,  1/3,  2/3),  the  first  of  which  is  not  feasible. 

Example  3.  Consider  the  constraint  set  in  E2  defined  in  terms  of  the  inequalities 

8 

*1  +  3*2  <  4 
x\  +  X2  <  2 

2x\  <  3 

X\  >  0,  X2  >  0. 
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This  set  is  illustrated  in  Fig.  2.5.  We  see  by  inspection  that  this  set  has  five  ex¬ 
treme  points.  In  order  to  compare  this  example  with  our  general  results  we  must 
introduce  slack  variables  to  yield  the  equivalent  set  in  E5 : 


8 


=  4 


X\  +  X2  +  X4  —  2 
2xi  +  V5  =  3 


x\  >  0,  X2>  0,  V3  >  0,  X4  >  0,  V5  >  0. 


A  basic  solution  for  this  system  is  obtained  by  setting  any  two  variables  to  zero  and 
solving  for  the  remaining  three.  As  indicated  in  Fig.  2.5,  each  edge  of  the  figure 
corresponds  to  one  variable  being  zero,  and  the  extreme  points  are  the  points  where 
two  variables  are  zero. 


Fig.  2.4  Feasible  set  for  Example  2 

The  last  example  illustrates  that  even  when  not  expressed  in  standard  form  the 
extreme  points  of  the  set  defined  by  the  constraints  of  a  linear  program  correspond  to 
the  possible  solution  points.  This  can  be  illustrated  more  directly  by  including  the 
objective  function  in  the  figure  as  well.  Suppose,  for  example,  that  in  Example  3 
the  objective  function  to  be  minimized  is  —2x\  -  X2.  The  set  of  points  satisfying 
—2x\  —  X2  —  z  for  fixed  z  is  a  line.  As  z  varies,  different  parallel  lines  are  obtained 
as  shown  in  Fig.  2.6.  The  optimal  value  of  the  linear  program  is  the  smallest  value 
of  z  for  which  the  corresponding  line  has  a  point  in  common  with  the  feasible  set. 
It  should  be  reasonably  clear,  at  least  in  two  dimensions,  that  the  points  of  solution 
will  always  include  an  extreme  point.  In  the  figure  this  occurs  at  the  point  (3/2,  1/2) 


with  z  =  -7/2. 


2.6  Exercises 


27 


2.6  Exercises 

1 .  Convert  the  following  problems  to  standard  form: 

(a)  minimize  v  +  2y  +  3z 
subject  to2<v  +  y<3 

4  <  v  +  z  <  5 
v  >  0,  y  >  0,  z  >  0. 

(b)  minimize  x  +  y  +  z 
subject  to  v  +  2y  +  3z  =  10 

v  >  1,  y  ^  2,  z  >  1. 


JC2  —  0  1 

Fig.  2.5  Feasible  set  for  Example  3 


2.  A  manufacturer  wishes  to  produce  an  alloy  that  is,  by  weight,  30  %  metal  A  and 
70  %  metal  B.  Five  alloys  are  available  at  various  prices  as  indicated  below: 


Alloy 

1  2  3  4  5 

%A 

10  25  50  75  95 

%B 

90  75  50  25  5 

Price/lb  |  $  5  $4  $3  $2  $  1.50 

The  desired  alloy  will  be  produced  by  combining  some  of  the  other  alloys.  The 
manufacturer  wishes  to  find  the  amounts  of  the  various  alloys  needed  and  to 
determine  the  least  expensive  combination.  Formulate  this  problem  as  a  linear 
program. 
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A' j 


Fig.  2.6  Illustration  of  extreme  point  solution 


3.  An  oil  refinery  has  two  sources  of  crude  oil:  a  light  crude  that  costs  $35/barrel 
and  a  heavy  crude  that  costs  $30/barrel.  The  refinery  produces  gasoline,  heating 
oil,  and  jet  fuel  from  crude  in  the  amounts  per  barrel  indicated  in  the  following 
table: 


Gasoline  Heating  oil  Jet  fuel 


Light  crude 

0.3 

0.2 

0.3 

Heavy  crude 

0.3 

0.4 

0.2 

The  refinery  has  contracted  to  supply  900,000  barrels  of  gasoline,  800,000  bar¬ 
rels  of  heating  oil,  and  500,000  barrels  of  jet  fuel.  The  refinery  wishes  to  find 
the  amounts  of  fight  and  heavy  crude  to  purchase  so  as  to  be  able  to  meet  its 
obligations  at  minimum  cost.  Formulate  this  problem  as  a  linear  program. 

4.  A  small  firm  specializes  in  making  five  types  of  spare  automobile  parts.  Each 
part  is  first  cast  from  iron  in  the  casting  shop  and  then  sent  to  the  finishing  shop 
where  holes  are  drilled,  surfaces  are  turned,  and  edges  are  ground.  The  required 
worker-hours  (per  100  units)  for  each  of  the  parts  of  the  two  shops  are  shown 
below: 

Part  1  2  3  4  5 

Casting  2  13  3  1 

Finishing  3  2  2  1  1 

The  profits  from  the  parts  are  $30,  $20,  $40,  $25,  and  $10  (per  100  units), 
respectively.  The  capacities  of  the  casting  and  finishing  shops  over  the  next 
month  are  700  and  1,000  worker-hours,  respectively.  Formulate  the  problem  of 
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determining  the  quantities  of  each  spare  part  to  be  made  during  the  month  so  as 
to  maximize  profit. 

5.  Convert  the  following  problem  to  standard  form  and  solve: 

maximize  x\  +  4^2  +  *3 
subject  to  2x\  -  2x2  +  *3  =  4 
*1  —  *3  =  1 

X2  >  0,  X3  >  0. 


6.  A  large  textile  firm  has  two  manufacturing  plants,  two  sources  of  raw  material, 
and  three  market  centers.  The  transportation  costs  between  the  sources  and  the 
plants  and  between  the  plants  and  the  markets  are  as  follows: 

Plant 


A 

B 

c  1 
Source 

S  1/ton 

SI,  5  0/ton 

/ 

S  2/ton 

$1.5  0/ton 

Plant 


Ten  tons  are  available  from  source  1  and  15  tons  from  source  2.  The  three  market 
centers  require  8  tons,  14  tons,  and  3  tons.  The  plants  have  unlimited  processing 
capacity. 

(a)  Formulate  the  problem  of  finding  the  shipping  patterns  from  sources  to 
plants  to  markets  that  minimizes  the  total  transportation  cost. 

(b)  Reduce  the  problem  to  a  single  standard  transportation  problem  with  two 
sources  and  three  destinations.  (Hint:  Find  minimum  cost  paths  from  sources 
to  markets.) 

(c)  Suppose  that  plant  A  has  a  processing  capacity  of  8  tons,  and  plant  B  has 
a  processing  capacity  of  7  tons.  Show  how  to  reduce  the  problem  to  two 
separate  standard  transportation  problems. 

7.  A  businessman  is  considering  an  investment  project.  The  project  has  a  lifetime 
of  4  years,  with  cash  flows  of  -$100,000,  +$50,000,  +$70,000,  and  +$30,000 
in  each  of  the  4  years,  respectively.  At  any  time  he  may  borrow  funds  at  the 
rates  of  12  %,  22  %,  and  34  %  (total)  for  1,  2,  or  3  periods,  respectively.  He  may 
loan  funds  at  10%  per  period.  He  calculates  the  present  value  of  a  project  as 
the  maximum  amount  of  money  he  would  pay  now,  to  another  party,  for  the 
project,  assuming  that  he  has  no  cash  on  hand  and  must  borrow  and  lend  to  pay 
the  other  party  and  operate  the  project  while  maintaining  a  nonnegative  cash 


Market 

I  2  3 

$4/ton  $2/ton  $l/ton 

$3/k>n  $4/ion  $2/ion 
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balance  after  all  debts  are  paid.  Formulate  the  project  valuation  problem  in  a 
linear  programming  framework. 

8.  Convert  the  following  problem  to  a  linear  program  in  standard  form: 

minimize  \x\  +  \y\  +  \z\ 
subject  to  v  +  y  <  1 
2x  +  z  =  3. 

9.  A  class  of  piecewise  linear  functions  can  be  represented  as  /(x)  =  Maximum 
(c[x  +  d\,  x  +  d2,  . . . ,  cj?x  +  dp).  For  such  a  function  /,  consider  the  problem 

minimize  /(x) 
subject  to  Ax  =  b,  x  >  0. 

Show  how  to  convert  this  problem  to  a  linear  programming  problem. 

10.  A  small  computer  manufacturing  company  forecasts  the  demand  over  the  next 
n  months  to  be  df,  i  -  1,2,  . . . ,  n.  In  any  month  it  can  produce  r  units,  using 
regular  production,  at  a  cost  of  b  dollars  per  unit.  By  using  overtime ,  it  can 
produce  additional  units  at  c  dollars  per  unit,  where  c  >  b.  The  firm  can  store 
units  from  month  to  month  at  a  cost  of  s  dollars  per  unit  per  month.  Formulate 
the  problem  of  determining  the  production  schedule  that  minimizes  cost.  (Hint: 
See  Exercise  9.) 

1 1 .  Discuss  the  situation  of  a  linear  program  that  has  one  or  more  columns  of  the  A 
matrix  equal  to  zero.  Consider  both  the  case  where  the  corresponding  variables 
are  required  to  be  nonnegative  and  the  case  where  some  are  free. 

12.  Suppose  that  the  matrix  A  =  (ai,  a2,  . . . ,  an)  has  rank  m,  and  that  for  some 
p  <  m ,  ai,  a2,  . . .,  &p  are  linearly  independent.  Show  that  m  -  p  vectors 
from  the  remaining  n  —  p  vectors  can  be  adjoined  to  form  a  set  of  m  linearly 
independent  vectors. 

13.  Suppose  that  x  is  a  feasible  solution  to  the  linear  program  (2.12),  with  A  an 
m  x  n  matrix  of  rank  m.  Show  that  there  is  a  feasible  solution  y  having  the  same 
value  (that  is,  cTy  =  crx)  and  having  at  most  m  +  1  positive  components. 

14.  What  are  the  basic  solutions  of  Example  3,  Sect.  2.5? 

15.  Let  S  be  a  convex  set  in  En  and  S*  a  convex  set  in  Em.  Suppose  T  is  an  m  x  n 
matrix  that  establishes  a  one-to-one  correspondence  between  S  and  S  *,  i.e.,  for 
every  s  e  S  there  is  s*  e  S*  such  that  Ts  =  s*,  and  for  every  s*  e  S*  there  is  a 
single  s  e  S  such  that  Ts  =  s*.  Show  that  there  is  a  one-to-one  correspondence 
between  extreme  points  of  S  and  S  * . 

16.  Consider  the  two  linear  programming  problems  in  Example  1,  Sect.  2.1,  one 
in  En  and  the  other  in  En+m.  Show  that  there  is  a  one-to-one  correspondence 
between  extreme  points  of  these  two  problems. 
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2. 1-2.4  The  approach  taken  in  this  chapter,  which  is  continued  in  the  next,  is  the 
more  or  less  standard  approach  to  linear  programming  as  presented  in,  for 
example,  Dantzig  [D6],  Hadley  [HI],  Gass  [G4],  Simonnard  [S6],  Murty 
[Mil],  and  Gale  [G2].  Also  see  Bazaraa,  Jarvis,  and  H.  F.  Sherali  [B6], 
Bertsimas  and  Tsitsiklis  [B13],  Cottle,  [C6],  Dantzig  and  Thapa  [D9,  DIO], 
Nash  and  Sofer  [Nl],  Saigal  [SI],  and  Vanderbei  [V3]. 

2.5  An  excellent  discussion  of  this  type  can  be  found  in  Simonnard  [S6]. 


Chapter  3 

The  Simplex  Method 


The  idea  of  the  simplex  method  is  to  proceed  from  one  basic  feasible  solution  (that 
is,  one  extreme  point)  of  the  constraint  set  of  a  problem  in  standard  form  to  another, 
in  such  a  way  as  to  continually  decrease  the  value  of  the  objective  function  until  a 
minimum  is  reached.  The  results  of  Chap.  2  assure  us  that  it  is  sufficient  to  consider 
only  basic  feasible  solutions  in  our  search  for  an  optimal  feasible  solution.  This 
chapter  demonstrates  that  an  efficient  method  for  moving  among  basic  solutions  to 
the  minimum  can  be  constructed. 

In  the  first  five  sections  of  this  chapter  the  simplex  machinery  is  developed  from 
a  careful  examination  of  the  system  of  linear  equations  that  defines  the  constraints 
and  the  basic  feasible  solutions  of  the  system.  This  approach,  which  focuses  on 
individual  variables  and  their  relation  to  the  system,  is  probably  the  simplest,  but 
unfortunately  is  not  easily  expressed  in  compact  form.  In  the  last  few  sections  of 
the  chapter,  the  simplex  method  is  viewed  from  a  matrix  theoretic  approach,  which 
focuses  on  all  variables  together.  This  more  sophisticated  viewpoint  leads  to  a  com¬ 
pact  notational  representation,  increased  insight  into  the  simplex  process,  and  to 
alternative  methods  for  implementation. 


3.1  Pivots 

To  obtain  a  firm  grasp  of  the  simplex  procedure,  it  is  essential  that  one  first  under¬ 
stand  the  process  of  pivoting  in  a  set  of  simultaneous  linear  equations.  There  are  two 
dual  interpretations  of  the  pivot  procedure. 
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3  The  Simplex  Method 


First  Interpretation 

Consider  the  set  of  simultaneous  linear  equations 


d\\X\  +  $12^-2  T  .  .  .  +  dinXn  —  b\ 

CL2\X\  +  CL22X2  +  •  •  •  +  Ct2nXn  ~  & 2 

:  :  (3.1) 

&m\X\  +  dm2 X2  T  .  .  .  +  OmnXn  —  b m , 

where  m  <  n.  In  matrix  form  we  write  this  as 

Ax  =  b.  (3.2) 

In  the  space  En  we  interpret  this  as  a  collection  of  m  linear  relations  that  must  be 
satisfied  by  a  vector  x.  Thus  denoting  by  a7  the  ith  row  of  A  we  may  express  (3.1)  as: 

axx  =  b\ 
a2x  =  b2 

:  (3.3) 

amx  =  bm. 

This  corresponds  to  the  most  natural  interpretation  of  (3.1)  as  a  set  of  m  equations. 

If  m  <  n  and  the  equations  are  linearly  independent,  then  there  is  not  a  unique 
solution  but  a  whole  linear  variety  of  solutions  (see  Appendix  B).  A  unique  solution 
results,  however,  if  n-m  additional  independent  linear  equations  are  adjoined.  For 
example,  we  might  specify  n-m  equations  of  the  form  ekx  =  0,  where  ek  is  the  kth 
unit  vector  (which  is  equivalent  to  x^  =  0),  in  which  case  we  obtain  a  basic  solu¬ 
tion  to  (3.1).  Different  basic  solutions  are  obtained  by  imposing  different  additional 
equations  of  this  special  form. 

If  Eq.  (3.3)  are  linearly  independent,  we  may  replace  a  given  equation  by  any 
nonzero  multiple  of  itself  plus  any  linear  combination  of  the  other  equations  in  the 
system.  This  leads  to  the  well-known  Gaussian  reduction  schemes,  whereby  mul¬ 
tiples  of  equations  are  systematically  subtracted  from  one  another  to  yield  either  a 
triangular  or  canonical  form.  It  is  well  known,  and  easily  proved,  that  if  the  first  m 
columns  of  A  are  linearly  independent,  the  system  (3.1)  can,  by  a  sequence  of  such 
multiplications  and  subtractions,  be  converted  to  the  following  canonical  form: 


X\  +a l(m+l)Xm+i  +  a l(m+2)Xm+2  +  ‘  ‘  +  CL\nXn  - 

X2  +a2{m+l)Xm+l  +  ^2(m+2)Xm+2  +  ’  ’  ’  +  ^2 nxn  ~  ®20 


X 


m 


'^m(m+l)Xm+l  3”  ^ m(m+2)Xm+2  "•"•••  +  a nm Xn  ^m0 • 


(3.4) 
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Corresponding  to  this  canonical  representation  of  the  system,  the  variables  x\, 
X2,  . . . ,  xm  are  called  basic  and  the  other  variables  are  nonbasic.  The  corresponding 
basic  solution  is  then: 


X\  010,  -^2  020?  •  •  •  ?  2Cjfi  0/7i 0?  2Cffi+ 1  0,  •  •  •  5  0, 

or  in  vector  form:  x  =  (ao,  0)  where  ao  is  ra-dimensional  and  0  is  the  ( n  -  m)- 
dimensional  zero  vector. 

Actually,  we  relax  our  definition  somewhat  and  consider  a  system  to  be  in  canon¬ 
ical  form  if,  among  the  n  variables,  there  are  m  basic  ones  with  the  property  that  each 
appears  in  only  one  equation,  its  coefficient  in  that  equation  is  unity,  and  no  two  of 
these  m  variables  appear  in  any  one  equation.  This  is  equivalent  to  saying  that  a 
system  is  in  canonical  form  if  by  some  reordering  of  the  equations  and  the  variables 
it  takes  the  form  (3.4). 

Also  it  is  customary,  from  the  dictates  of  economy,  to  represent  the  system  (3.4) 
by  its  corresponding  array  of  coefficients  or  tableau : 


X\ 

*2 

v3 

Xm 

Xm+ 1 

Xm+2 

xn 

1 

0 

0 

0 

01(m+l) 

01(m+2) 

01  n 

010 

0 

1 

0 

0 

02(m+l) 

02(m+2) 

... 

020 

0 

0 

1 

• 

• 

• 

• 

0 

0 

0 

1 

0m(m+ 1) 

0ffl(m+ 2) 

0/72/1 

0/720 

(3.5) 


The  question  solved  by  pivoting  is  this:  given  a  system  in  canonical  form,  suppose 
a  basic  variable  is  to  be  made  nonbasic  and  a  nonbasic  variable  is  to  be  made  basic; 
what  is  the  new  canonical  form  corresponding  to  the  new  set  of  basic  variables?  The 
procedure  is  quite  simple.  Suppose  in  the  canonical  system  (3.4)  we  wish  to  replace 
the  basic  variable  xp,  1  <  p  <  m,  by  the  nonbasic  variable  xq.  This  can  be  done  if 
and  only  if  apq  is  nonzero;  it  is  accomplished  by  dividing  row  p  by  apq  to  get  a  unit 
coefficient  for  xq  in  the  pth  equation,  and  then  subtracting  suitable  multiples  of  row 
p  from  each  of  the  other  rows  in  order  to  get  a  zero  coefficient  for  xq  in  all  other 
equations.  This  transforms  the  qth  column  of  the  tableau  so  that  it  is  zero  except 
in  its  pth  entry  (which  is  unity)  and  does  not  affect  the  columns  of  the  other  basic 
variables.  Denoting  the  coefficients  of  the  new  system  in  canonical  form  by  aL  we 
have  explicitly 


aij 


@iq  — 


a„aPJ ’  l*P 


a 


pj 


a 


pq 


(3.6) 


Equation  (3.6)  are  the  pivot  equations  that  arise  frequently  in  linear  programming. 
The  element  apq  in  the  original  system  is  said  to  be  the  pivot  element. 
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Example  1.  Consider  the  system  in  canonical  form: 

X\  +  X4  +  X5  ~  X6  =  5 

X2  +  2x4  -  3x5  +  xe  =  3 

X3  —  X4  +  2X5  ~  x6  ~  ~  1  • 


Let  us  find  the  basic  solution  having  basic  variables  X4,  X5,  x&.  We  set  up  the  coef¬ 
ficient  array  below: 


Xl 

Xi 

X3  X4 

x5  x6 

1 

0 

0  (D 

1  -1 

5 

0 

1 

0  2 

-3  1 

3 

0 

0 

1  -1 

2  -1 

-1 

The  circle  indicated  is  our  first  pivot  element  and  corresponds  to  the  replacement  of 
x\  by  X4  as  a  basic  variable.  After  pivoting  we  obtain  the  array 

*1 

*2 

x3  x4 

*5  *6 

1 

0 

0  1 

1  -1 

5 

-2 

1 

0  0 

©  3 

-7 

1 

0 

1  0 

3  -2 

4 

and  again  we  have  circled  the  next  pivot  element  indicating 

our  intention  to  replace 

•*2  by  X5.  We  then  obtain 

xx 

*2 

*5  *6 

3/5 

1/5 

0  1 

0  -2/5 

18/5 

2/5 

-1/5 

0  0 

1  -3/5 

7/5 

-1/5 

3/5 

1  0 

0  Q/5) 

-1/5 

Continuing,  there  results 

X\ 

x2 

X3  X4 

X5  x6 

1 

-1 

-2  1 

0  0 

4 

1 

-2 

-3  0 

1  0 

2 

1 

-3 

-5  0 

0  1 

1 

From  this  last  canonical  form  we  obtain  the  new  basic  solution 


X4  -  4,  X5  =  2,  *6  =  1. 


Second  Interpretation 

The  set  of  simultaneous  equations  represented  by  (3.1)  and  (3.2)  can  be  interpreted 
in  Em  as  a  vector  equation.  Denoting  the  columns  of  A  by  ai,  a2,  . . . ,  an  we  write 
(3.1)  as 

X\2i\  +  V2a2  +  •  •  •  +  vnan  —  b. 

In  this  interpretation  we  seek  to  express  b  as  a  linear  combination  of  the  a/’s. 


(3.7) 
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If  m  <  n  and  the  vectors  a j  span  Em  then  there  is  not  a  unique  solution  but  a 
whole  family  of  solutions.  The  vector  b  has  a  unique  representation,  however,  as 
a  linear  combination  of  a  given  linearly  independent  subset  of  these  vectors.  The 
corresponding  solution  with  ( n  -  m )  xj  variables  set  equal  to  zero  is  a  basic  solution 
to  (3.1). 

Suppose  now  that  we  start  again  with  a  system  in  the  canonical  form  correspond¬ 
ing  to  the  tableau: 


ai  a2  a3 

1  0  0 

0  1  0 

0  0  1 


&m+ 1  &m+2 

0  01(m+l)  ^l(m+2) 

0  <22(ra+ 1)  <^20+2) 


an  b 

^ln  5io 

<220 


(3.8) 


0  0  0 


1 


<00+  0  <2m(m+2) 


a 


mn 


<Oo 


In  this  case  the  first  m  vectors  form  a  basis.  Furthermore,  every  other  vector  repre¬ 
sented  in  the  tableau  can  be  expressed  as  a  linear  combination  of  these  basis  vectors 
by  simply  reading  the  coefficients  down  the  corresponding  column.  Thus 


a j  —  +  0*2 j&2  3"  ■  ■  ■  <0 j-am. 


(3.9) 


The  tableau  can  be  interpreted  as  giving  the  representations  of  the  vectors  ay 
in  terms  of  the  basis;  the  jth  column  of  the  tableau  is  the  representation  for  the 
vector  stj.  In  particular,  the  expression  for  b  in  terms  of  the  basis  is  given  in  the  last 
column. 

We  now  consider  the  operation  of  replacing  one  member  of  the  basis  by  another 
vector  not  already  in  the  basis.  Suppose  for  example  we  wish  to  replace  the  basis 
vector  ap,  1  <  p  <  m,  by  the  vector  aq.  Provided  that  the  first  m  vectors  with  ap 
replaced  by  aq  are  linearly  independent  these  vectors  constitute  a  basis  and  every 
vector  can  be  expressed  as  a  linear  combination  of  this  new  basis.  To  find  the  new 
representations  of  the  vectors  we  must  update  the  tableau.  The  linear  independence 
condition  holds  if  and  only  if  apq±  0. 

Any  vector  ay-  can  be  expressed  in  terms  of  the  old  array  through  (3.9).  For  we 
have 

i=  1 
tep 

from  which  we  may  solve  for  ap, 


a  „  = 


-z 


Qiq&i 


+  Cl 


pq 


1 

—a  q 

dpq 


m 


i=  1 
i^p 


a 


iq 


a,. 


a 


pq 


(3.10) 
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Substituting  (3.10)  into  (3.9)  we  obtain: 


■>=z 

i=i 

i+p 


Denoting  the  coefficients  of  the  new  tableau,  which  give  the  linear  combinations, 
by  a\.  we  obtain  immediately  from  (3.11) 


°ij- 


& iq  


a 


Mpj  I  + 


aPj 


pq 


a 


(3.11) 


pq 


@iq  —  •  , 

—apj,  i  ±  p 

Llpq 


(3.12) 


These  formulas  are  identical  to  (3.6). 

If  a  system  of  equations  is  not  originally  given  in  canonical  form,  we  may  put 
it  into  canonical  form  by  adjoining  the  m  unit  vectors  to  the  tableau  and,  starting 
with  these  vectors  as  the  basis,  successively  replace  each  of  them  with  columns  of 
A  using  the  pivot  operation. 

Example  2.  Suppose  we  wish  to  solve  the  simultaneous  equations 


X\  +  X2  ~  X3  =  5 
2xi  -  3x2  +  X3  =  3 

—X\  +  2X2  -  X3  =  - 1 . 


To  obtain  an  original  basis,  we  form  the  augmented  tableau 


and  replace  ei  by  ai, 
to  those  of  Example  1 . 


ei 

e2 

e3 

ai 

a2 

a3 

b 

1 

0 

0 

1 

1 

-1 

5 

0 

1 

0 

2 

-3 

1 

3 

0 

0 

1 

-1 

2 

-1 

-1 

e2  by  a2, 

and  e3 

by  a3. 

The 

required  operations 

3.2  Adjacent  Extreme  Points 

In  Chap.  2  it  was  discovered  that  it  is  only  necessary  to  consider  basic  feasible  solu¬ 
tions  to  the  system 


Ax  =  b,  x  >  0  (3.13) 

when  solving  a  linear  program,  and  in  the  previous  section  it  was  demonstrated  that 
the  pivot  operation  can  generate  a  new  basic  solution  from  an  old  one  by  replacing 
one  basic  variable  by  a  nonbasic  variable.  It  is  clear,  however,  that  although  the  pivot 


3.2  Adjacent  Extreme  Points 
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operation  takes  one  basic  solution  into  another,  the  nonnegativity  of  the  solution  will 
not  in  general  be  preserved.  Special  conditions  must  be  satisfied  in  order  that  a  pivot 
operation  maintain  feasibility.  In  this  section  we  show  how  it  is  possible  to  select 
pivots  so  that  we  may  transfer  from  one  basic  feasible  solution  to  another. 

We  show  that  although  it  is  not  possible  to  arbitrarily  specify  the  pair  of  vari¬ 
ables  whose  roles  are  to  be  interchanged  and  expect  to  maintain  the  nonnegativity 
condition,  it  is  possible  to  arbitrarily  specify  which  nonbasic  variable  is  to  become 
basic  and  then  determine  which  basic  variable  should  become  nonbasic.  As  is  con¬ 
ventional,  we  base  our  derivation  on  the  vector  interpretation  of  the  linear  equations 
although  the  dual  interpretation  could  alternatively  be  used. 


Nondegeneracy  Assumption 

Many  arguments  in  linear  programming  are  substantially  simplified  upon  the  intro¬ 
duction  of  the  following. 

Nondegeneracy  Assumption:  Every  basic  feasible  solution  of  (3.13)  is  a  nondegenerate 
basic  feasible  solution. 

This  assumption  is  invoked  throughout  our  development  of  the  simplex  method, 
since  when  it  does  not  hold  the  simplex  method  can  break  down  if  it  is  not  suitably 
amended.  The  assumption,  however,  should  be  regarded  as  one  made  primarily  for 
convenience,  since  all  arguments  can  be  extended  to  include  degeneracy,  and  the 
simplex  method  itself  can  be  easily  modified  to  account  for  it. 


Determination  of  Vector  to  Leave  Basis 

Suppose  we  have  the  basic  feasible  solution  x  =  (xi , 

■^^2  9  *  *  *  9  9  0, 0,  . . . ,  0)  or, 

equivalently,  the  representation 

*iai  +  x22i2  +  ■  ■  ■  +  xmSLm  —  b.  (3.14) 

Under  the  nondegeneracy  assumption,  xj  >  0,  i  =  1,2,  . . . ,  m.  Suppose  also  that 
we  have  decided  to  bring  into  the  representation  the  vector  aq,  q  >  m.  We  have 
available  a  representation  of  aq  in  terms  of  the  current  basis 

&q  —  ®\q^\  ^2q^2  +  '  '  '  +  Clmq<im.  (3.15) 

Multiplying  (3.15)  by  a  variable  s  >  0  and  subtracting  from  (3.14),  we  have 

(*1  -  eaiq) ai  +  (x2  -  sa2q)a2  +  ■  ■  ■  +  (xm  ~  samq) am  +  s2Lq  =  b.  (3.16) 
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Thus,  for  any  s  >  0  (3.16)  gives  b  as  a  linear  combination  of  at  most  m  +  1  vectors. 
For  e  -  0  we  have  the  old  basic  feasible  solution.  As  s  is  increased  from  zero, 
the  coefficient  of  aq  increases,  and  it  is  clear  that  for  small  enough  s ,  (3.16)  gives 
a  feasible  but  nonbasic  solution.  The  coefficients  of  the  other  vectors  will  either 
increase  or  decrease  linearly  as  s  is  increased.  If  any  decrease,  we  may  set  s  equal 
to  the  value  corresponding  to  the  first  place  where  one  (or  more)  of  the  coefficients 
vanishes.  That  is 

£  =  min {xj/aiq  :  aiq  >  0}.  (3.17) 

i 

In  this  case  we  have  a  new  basic  feasible  solution,  with  the  vector  aq  replacing  the 
vector  a p,  where  p  corresponds  to  the  minimizing  index  in  (3. 17).  If  the  minimum  in 
(3.17)  is  achieved  by  more  than  a  single  index  i,  then  the  new  solution  is  degenerate 
and  any  of  the  vectors  with  zero  component  can  be  regarded  as  the  one  that  left  the 
basis. 

If  none  of  the  atq  s  are  positive,  then  all  coefficients  in  the  representation  (3.16) 
increase  (or  remain  constant)  as  £  is  increased,  and  no  new  basic  feasible  solution  is 
obtained.  We  observe,  however,  that  in  this  case,  where  none  of  the  aiq  s  are  positive, 
there  are  feasible  solutions  to  (3.13)  having  arbitrarily  large  coefficients.  This  means 
that  the  set  K  of  feasible  solutions  to  (3.13)  is  unbounded,  and  this  special  case,  as 
we  shall  see,  is  of  special  significance  in  the  simplex  procedure. 

In  summary,  we  have  deduced  that,  given  a  basic  feasible  solution  and  an  arbi¬ 
trary  vector  a q,  there  is  either  a  new  basic  feasible  solution  having  aq  in  its  basis  and 
one  of  the  original  vectors  removed,  or  the  set  of  feasible  solutions  is  unbounded. 

Let  us  consider  how  the  calculation  of  this  section  can  be  displayed  in  our 
tableau.  We  assume  that  corresponding  to  the  constraints 

Ax  =  b,  x  >  0, 


we  have  a  tableau  of  the  form  (3.8).  Note  that  the  tableau  may  be  the  result  of  sev¬ 
eral  pivot  operations  applied  to  the  original  tableau,  but  in  any  event,  it  represents  a 
solution  with  basis  ai,  a2,  . . . ,  am.  We  assume  that  fiio,  ^20,  •  •  • ,  amo  are  nonneg¬ 
ative,  so  that  the  corresponding  basic  solution  x\  =  fiio,  *2  =  ^20,  •  •  • ,  xm  =  am 0  is 
feasible.  We  wish  to  bring  into  the  basis  the  vector  a^,  q  >  m,  and  maintain  feasibil¬ 
ity.  In  order  to  determine  which  element  in  the  gth  column  to  use  as  the  pivot  (and 
hence  which  vector  in  the  basis  will  leave),  we  use  (3.17)  and  compute  the  ratios 
Xilaiq  =  aio/aiq ,  i  =  1,2,  . . . ,  m,  select  the  smallest  nonnegative  ratio,  and  pivot  on 
the  corresponding  aiq. 


Example  3.  Consider  the  system 


ai  a2  a3 

1  0  0 

0  1  0 

0  0  1 


a4  as  a6  b 

2  4  6  4 

12  3  3 

-12  11 
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which  has  basis  ai,  a2,  &3  yielding  a  basic  feasible  solution  x  =  (4, 3, 1, 0, 0, 0). 
Suppose  we  elect  to  bring  a4  into  the  basis.  To  determine  which  element  in  the 
fourth  column  is  the  appropriate  pivot,  we  compute  the  three  ratios: 


4/2  =  2,  3/1  =  3,  1/  —  1  =  — 1 


and  select  the  smallest  nonnegative  one.  This  gives  2  as  the  pivot  element.  The  new 
tableau  is 


ai  a2  a3  a4 

1/2  0  0  1  2 

-1/2  10  0  0 

1/20104 


a6  b 

3  2 
0  1 

4  3 


with  corresponding  basic  feasible  solution  x  =  (0, 1,  3, 2, 0,  0). 


Our  derivation  of  the  method  for  selecting  the  pivot  in  a  given  column  that  will 
yield  a  new  feasible  solution  has  been  based  on  the  vector  interpretation  of  the  equa¬ 
tion  Ax  =  b.  An  alternative  derivation  can  be  constructed  by  considering  the  dual 
approach  that  is  based  on  the  rows  of  the  tableau  rather  than  the  columns.  Briefly, 
the  argument  runs  like  this:  if  we  decide  to  pivot  on  apq ,  then  we  first  divide  the  pth 
row  by  the  pivot  element  apq  to  change  it  to  unity.  In  order  that  the  new  ap o  remain 
positive,  it  is  clear  that  we  must  have  apq  >  0.  Next  we  subtract  multiples  of  the 
pth  row  from  each  other  row  in  order  to  obtain  zeros  in  the  qth  column.  In  this  pro¬ 
cess  the  new  elements  in  the  last  column  must  remain  nonnegative — if  the  pivot  was 
properly  selected.  The  full  operation  is  to  subtract,  from  the  zth  row,  aiqlapq  times 
the  pth  row.  This  yields  a  new  solution  obtained  directly  from  the  last  column: 


a 

upq 

For  this  to  remain  nonnegative,  it  follows  that  xplapq  <  Xi/a iq,  and  hence  again  we 
are  led  to  the  conclusion  that  we  select  p  as  the  index  i  minimizing  Xi/aiq. 


Geometrical  Interpretations 

Corresponding  to  the  two  interpretations  of  pivoting  and  extreme  points  developed 
algebraically,  are  two  geometrical  interpretations.  The  first  is  in  activity  space ,  the 
space  where  x  is  represented.  This  is  perhaps  the  most  natural  space  to  consider,  and 
it  was  used  in  Sect.  2.5.  Here  the  feasible  region  is  shown  directly  as  a  convex  set, 
and  basic  feasible  solutions  are  extreme  points.  Adjacent  extreme  points  are  points 
that  lie  on  a  common  edge. 

The  second  geometrical  interpretation  is  in  requirements  space ,  the  space  where 
the  columns  of  A  and  b  are  represented.  The  fundamental  relation  is 

SL\X\  +  &2V2  +  ’  ’  ’  +  SLnXn  =  b. 
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Fig.  3.1  Constraint  representation  in  requirements  space 


An  example  for  m  -  2,  n  -  4  is  shown  in  Fig.  3.1.  A  feasible  solution  defines  a 
representation  of  b  as  a  positive  combination  of  the  a/’s.  A  basic  feasible  solution 
will  use  only  m  positive  weights.  In  the  figure  a  basic  feasible  solution  can  be  con¬ 
structed  with  positive  weights  on  ai  and  a2  because  b  lies  between  them.  A  basic 
feasible  solution  cannot  be  constructed  with  positive  weights  on  ai  and  a4.  Suppose 
we  start  with  ai  and  a2  as  the  initial  basis.  Then  an  adjacent  basis  is  found  by  bring¬ 
ing  in  some  other  vector.  If  a3  is  brought  in,  then  clearly  a2  must  go  out.  On  the 
other  hand,  if  a4  is  brought  in,  ai  must  go  out. 


3.3  Determining  a  Minimum  Feasible  Solution 

In  the  last  section  we  showed  how  it  is  possible  to  pivot  from  one  basic  feasible 
solution  to  another  (or  determine  that  the  solution  set  is  unbounded)  by  arbitrarily 
selecting  a  column  to  pivot  on  and  then  appropriately  selecting  the  pivot  in  that 
column.  The  idea  of  the  simplex  method  is  to  select  the  column  so  that  the  resulting 
new  basic  feasible  solution  will  yield  a  lower  value  to  the  objective  function  than 
the  previous  one.  This  then  provides  the  final  link  in  the  simplex  procedure.  By  an 
elementary  calculation,  which  is  derived  below,  it  is  possible  to  determine  which 
vector  should  enter  the  basis  so  that  the  objective  value  is  reduced,  and  by  another 
simple  calculation,  derived  in  the  previous  section,  it  is  possible  to  then  determine 
which  vector  should  leave  in  order  to  maintain  feasibility. 

Suppose  we  have  a  basic  feasible  solution 

(xg,  0)  =  (flio,  ^20,  •  •  • ,  amo,  0, 0,  . . . ,  0) 

together  with  a  tableau  having  an  identity  matrix  appearing  in  the  first  m  columns 
as  shown  in  tableau  (3.8).  The  value  of  the  objective  function  corresponding  to  any 
solution  x  is 

Z  —  C\X\  +  C2X2  +  •  •  •  +  cnxn , 


(3.18) 
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and  hence  for  the  basic  solution,  the  corresponding  value  is 

Zo  =  CbXB,  (3.19) 

where  Cg  =  [c i,  C2,  . . . ,  cm]. 

Although  it  is  natural  to  use  the  basic  solution  (xb,  0)  when  we  have  the  tableau 
(3.8),  it  is  clear  that  if  arbitrary  values  are  assigned  to  xm+\,  xm+2 ,  . . . ,  xn,  we  can 
easily  solve  for  the  remaining  variables  as 


n 

x\  =  5io  -  ^  a\ jXj 

j—m+ 1 
n 

Xl  ~  Cl2{)  ~  ^  Cl2 jXj 
j-m+ 1 

:  (3.20) 

n 

Xm  ^mO  ^  ^  &mjXj. 
j=m+ 1 

Using  (3.20)  we  may  eliminate  m,  JC2,  . . . ,  from  the  general  formula  (3.18). 
Doing  this  we  obtain 

T 

Z  —  C  X  —  £0  +  (cm+i  —  Zm+ 1 ) Vm+  ] 

■F(^*m+2  ^m+2)-^m+2  +  '  ‘  +  (cw  —  Zn)Xn  (3.21) 

where 

Zj  —  CL\jC\  +  Cl2jC2  +  •  •  ‘  +  C lmjCm ,  +  1  ^  j  ^  XX,  (3.22) 

which  is  the  fundamental  relation  required  to  determine  the  pivot  column.  The  imp¬ 
ortant  point  is  that  this  equation  gives  the  values  of  the  objective  function  z  for 
any  solution  of  Ax  =  b  in  terms  of  the  variables  xm+ 1 ,  . . . ,  xn.  From  it  we  can 
determine  if  there  is  any  advantage  in  changing  the  basic  solution  by  introducing 
one  of  the  nonbasic  variables.  For  example,  if  Cj  -  zj  is  negative  for  some  j,  m  +  1  < 
j  <  xx,  then  increasing  Xj  from  zero  to  some  positive  value  would  decrease  the  total 
cost,  and  therefore  would  yield  a  better  solution.  The  formula  (3.21)  and  (3.22) 
automatically  take  into  account  the  changes  that  would  be  required  in  the  values  of 
the  basic  variables  x\,  X2,  . . . ,  xm  to  accommodate  the  change  in  xj. 

Let  us  derive  these  relations  from  a  different  viewpoint.  Let  a  j  be  the  jth  column 
of  the  tableau.  Then  any  solution  satisfies 

•Mei  +  X2&2  +  *  '  *  +  Xm£m  —  U()  Vm+iam+i  Xm+2&m+2  —  •  •  •  —  xnfin. 

Taking  the  inner  product  of  this  vector  equation  with  Cg,  we  have 


m 


n 


Yj  cJxJ  =  cb3o  -  L  zixp 

j-m+\ 


i=  1 
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n 

where  zj  -  c^a j.  Thus,  adding  CjXj  to  both  sides, 

j—m+\ 

n 

cTx  =  Zo  +  ^  (Cj  -  Zj)xj  (3.23) 

j=m+ 1 

as  before. 

We  now  state  the  condition  for  improvement,  which  follows  easily  from  the 
above  observation,  as  a  theorem. 

Theorem  ( Improvement  of  Basic  Feasible  Solution).  Given  a  nondegenerate  basic  fea¬ 
sible  solution  with  corresponding  objective  value  zo,  suppose  that  for  some  j  there  holds 
Cj  -  Zj  <  0.  Then  there  is  a  feasible  solution  with  objectivevalue  z  <  zq-  If  the  column  a7  can 
be  substituted  for  some  vector  in  the  originalbasis  to  yield  a  new  basic  feasible  solution, 
this  new  solution  will  have  z  <  Zo-  If  &j  cannot  be  substituted  to  yield  a  basic  feasible  solu¬ 
tion,  then  the  solutionset  K  is  unbounded  and  the  objective  function  can  be  made  arbitrarily 
small  (toward  minus  infinity). 

Proof.  The  result  is  an  immediate  consequence  of  the  previous  discussion.  Let 
(jci  ,  X2,  . . . ,  xm,  0, 0,  . . . ,  0)  be  the  basic  feasible  solution  with  objective  value 
Zo  and  suppose  cm+\  -  zm+ 1  <  0-  Then,  in  any  case,  new  feasible  solutions  can  be 
constructed  of  the  form  (vj,  x'2,  . . . ,  x'm,  x’  v  0, 0,  . . . ,  0)  with  x'm+l  >  0.  Substi¬ 
tuting  this  solution  in  (3.21)  we  have 

Z  ~  Zo  —  (£*772+  1  _  Zm+ 1  )-^m+ 1  <  0, 

and  hence  z  <  Zo  for  any  such  solution.  It  is  clear  that  we  desire  to  make  x'm+1  as  large 
as  possible.  As  x'm+[  is  increased,  the  other  components  increase,  remain  constant, 
or  decrease.  Thus  x'  ^  can  be  increased  until  one  x'.  =  0 ,  /  <  m,  in  which  case 
we  obtain  a  new  basic  feasible  solution,  or  if  none  of  the  v'’s  decrease,  x'  ^  can 

l  7  722+ 1 

be  increased  without  bound  indicating  an  unbounded  solution  set  and  an  objective 
value  without  lower  bound.  I 

We  see  that  if  at  any  stage  Cj  -  zj  <  0  for  some  j,  it  is  possible  to  make  Xj 
positive  and  decrease  the  objective  function.  The  final  question  remaining  is  whether 
Cj  -  Zj  >  0  for  all  j  implies  optimality. 

Optimality  Condition  Theorem.  If  for  some  basic  feasible  solution  Cj—Zj  >  0  for  all  j ,  then 
that  solution  is  optimal. 

Proof.  This  follows  immediately  from  (3.21),  since  any  other  feasible  solution  must 
have  Xi  >  0  for  all  i,  and  hence  the  value  z  of  the  objective  will  satisfy  z  -  Zo  >  0. 1 

Since  the  constants  Cj  -  Zj  play  such  a  central  role  in  the  development  of  the 
simplex  method,  it  is  convenient  to  introduce  the  somewhat  abbreviated  notation 
rj  =  cj  -  Zj  and  refer  to  the  r/s  as  the  relative  cost  coefficients  or,  alternatively,  the 
reduced  cost  coefficients  (both  terms  occur  in  common  usage).  These  coefficients 
measure  the  cost  of  a  variable  relative  to  a  given  basis.  (For  notational  convenience 
we  extend  the  definition  of  relative  cost  coefficients  to  basic  variables  as  well;  the 
relative  cost  coefficient  of  a  basic  variable  is  zero.) 
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We  conclude  this  section  by  giving  an  economic  interpretation  of  the  relative  cost 
coefficients.  Let  us  agree  to  interpret  the  linear  program 

minimize  cTx 
subject  to  Ax  =  b,  x>  0 

as  a  diet  problem  (see  Sect.  2.2)  where  the  nutritional  requirements  must  be  met 
exactly.  A  column  of  A  gives  the  nutritional  equivalent  of  a  unit  of  a  particular  food. 
With  a  given  basis  consisting  of,  say,  the  first  m  columns  of  A,  the  corresponding 
simplex  tableau  shows  how  any  food  (or  more  precisely,  the  nutritional  content  of 
any  food)  can  be  constructed  as  a  combination  of  foods  in  the  basis.  For  instance, 
if  carrots  are  not  in  the  basis  we  can,  using  the  description  given  by  the  tableau, 
construct  a  synthetic  carrot  which  is  nutritionally  equivalent  to  a  carrot,  by  an  app¬ 
ropriate  combination  of  the  foods  in  the  basis. 

In  considering  whether  or  not  the  solution  represented  by  the  current  basis  is 
optimal,  we  consider  a  certain  food  not  in  the  basis — say  carrots — and  determine  if 
it  would  be  advantageous  to  bring  it  into  the  basis.  This  is  very  easily  determined 
by  examining  the  cost  of  carrots  as  compared  with  the  cost  of  synthetic  carrots.  If 
carrots  are  food  j,  then  the  unit  cost  of  carrots  is  Cj.  The  cost  of  a  unit  of  synthetic 
carrots  is,  on  the  other  hand, 

m 

z.i  ~  ^ 

1=1 

If  tj  -  Cj  -Zj  <  0,  it  is  advantageous  to  use  real  carrots  in  place  of  synthetic  carrots, 
and  carrots  should  be  brought  into  the  basis. 

In  general  each  zj  can  be  thought  of  as  the  price  of  a  unit  of  the  column  a7-  when 
constructed  from  the  current  basis.  The  difference  between  this  synthetic  price  and 
the  direct  price  of  that  column  determines  whether  that  column  should  enter  the 
basis. 


3.4  Computational  Procedure:  Simplex  Method 

In  previous  sections  the  theory,  and  indeed  much  of  the  technique,  necessary  for 
the  detailed  development  of  the  simplex  method  has  been  established.  It  is  only 
necessary  to  put  it  all  together  and  illustrate  it  with  examples. 

In  this  section  we  assume  that  we  begin  with  a  basic  feasible  solution  and  that  the 
tableau  corresponding  to  Ax  =  b  is  in  the  canonical  form  for  this  solution.  Methods 
for  obtaining  this  first  basic  feasible  solution,  when  one  is  not  obvious,  are  described 
in  the  next  section. 

In  addition  to  beginning  with  the  array  Ax  =  b  expressed  in  canonical  form 
corresponding  to  a  basic  feasible  solution,  we  append  a  row  at  the  bottom  consisting 
of  the  relative  cost  coefficients  and  the  negative  of  the  current  cost.  The  result  is  a 
simplex  tableau. 
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Thus,  if  we  assume  the  basic  variables  are  (in  order)  x\,  x2, . . . ,  xm,  the  simplex 
tableau  takes  the  initial  form  shown  in  Fig.  3.2. 

The  basic  solution  corresponding  to  this  tableau  is 

_  J  a®  0  <  i  <  m 
1 0  m  +  1  <  i  <  n 

which  we  have  assumed  is  feasible,  that  is,  at o  >  0,  i  =  1,2,  . . . ,  m.  The  corre¬ 
sponding  value  of  the  objective  function  is  zq. 


ai 

a2 

a  m 

a  m+ 1 

&m+2 

•  •  •  a7- 

a  n 

b 

1 

0 

...  0 

C 1  (m+ 1 ) 

a\(m+ 2) 

•  •  •  aij 

C\n 

flio 

0 

1 

0 

^2  (m+ 1 ) 

«2(m+2) 

•  •  •  a2j 

n 

flio 

0 

0 

0 

&i(m+ 1) 

Q'i{m+ 2) 

... 

d[j 

C-in 

CliO 

0 

0 

1 

&m(m+ 1) 

Cm(m+ 2) 

•  .  • 

Cjnj 

C-mn 

C/nO 

0 

0 

0 

lrm+ 1 

1‘m+ 2 

0 

rn 

-Zo 

Fig.  3.2  Canonical  simplex  tableau 


The  relative  cost  coefficients  rj  indicate  whether  the  value  of  the  objective  will 
increase  or  decrease  if  xj  is  pivoted  into  the  solution.  If  these  coefficients  are  all 
nonnegative,  then  the  indicated  solution  is  optimal.  If  some  of  them  are  negative,  an 
improvement  can  be  made  (assuming  nondegeneracy)  by  bringing  the  correspond¬ 
ing  component  into  the  solution.  When  more  than  one  of  the  relative  cost  coefficients 
is  negative,  any  one  of  them  may  be  selected  to  determine  in  which  column  to  pivot. 
Common  practice  is  to  select  the  most  negative  value.  (See  Exercise  13  for  further 
discussion  of  this  point.) 

Some  more  discussion  of  the  relative  cost  coefficients  and  the  last  row  of  the 
tableau  is  warranted.  We  may  regard  z  as  an  additional  variable  and 

C\X\  +  C2x 2  +  •  •  •  +  CnXn  -Z  =  0 


as  another  equation.  A  basic  solution  to  the  augmented  system  will  have  m  +  1  basic 
variables,  but  we  can  require  that  z  be  one  of  them.  For  this  reason  it  is  not  neces¬ 
sary  to  add  a  column  corresponding  to  z,  since  it  would  always  be  (0, 0,  . . . ,  0, 1). 
Thus,  initially,  a  last  row  consisting  of  the  c/s  and  a  right-hand  side  of  zero  can  be 
appended  to  the  standard  array  to  represent  this  additional  equation.  Using  standard 
pivot  operations,  the  elements  in  this  row  corresponding  to  basic  variables  can  be 
reduced  to  zero.  This  is  equivalent  to  transforming  the  additional  equation  to  the 
form 


T  rm+2Xm+2  +  '  '  '  +  Z  ~  Zq- 


(3.24) 
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This  must  be  equivalent  to  (3.23),  and  hence  the  rf  s  obtained  are  the  relative  cost 
coefficients.  Thus,  the  last  row  can  be  treated  operationally  like  any  other  row:  just 
start  with  c/s  and  reduce  the  terms  corresponding  to  basic  variables  to  zero  by  row 
operations. 

After  a  column  q  is  selected  in  which  to  pivot,  the  final  selection  of  the  pivot 
element  is  made  by  computing  the  ratio  aio/aiq  for  the  positive  elements  aiq,  i  - 
1,2,  . . . ,  m,  of  the  gth  column  and  selecting  the  element  p  yielding  the  minimum 
ratio.  Pivoting  on  this  element  will  maintain  feasibility  as  well  as  (assuming  nonde¬ 
generacy)  decrease  the  value  of  the  objective  function.  If  there  are  ties,  any  element 
yielding  the  minimum  can  be  used.  If  there  are  no  nonnegative  elements  in  the  col¬ 
umn,  the  problem  is  unbounded.  After  updating  the  entire  tableau  with  apq  as  pivot 
and  transforming  the  last  row  in  the  same  manner  as  all  other  rows  (except  row  q ), 
we  obtain  a  new  tableau  in  canonical  form.  The  new  value  of  the  objective  function 
again  appears  in  the  lower  right-hand  corner  of  the  tableau. 

The  simplex  algorithm  can  be  summarized  by  the  following  steps: 

Step  0.  Form  a  tableau  as  in  Fig.  3.2  corresponding  to  a  basic  feasible  solution. 

The  relative  cost  coefficients  can  be  found  by  row  reduction. 

Step  1 .  If  each  rj  ^  stop;  the  current  basic  feasible  solution  is  optimal. 

Step  2.  Select  q  such  that  rq  <  0  to  determine  which  nonbasic  variable  is  to  be¬ 
come  basic. 

Step  3.  Calculate  the  ratios  aio/ciiq  for  >  0,  i  =  1, 2,  . . . ,  m.  If  no  aiq  >  0, 
stop;  the  problem  is  unbounded.  Otherwise,  select  p  as  the  index  i  corresponding 
to  the  minimum  ratio. 

Step  4.  Pivot  on  the  pqt\\  element,  updating  all  rows  including  the  last.  Return  to 
Step  1. 

Proof  that  the  algorithm  solves  the  problem  (again  assuming  nondegeneracy)  is 
essentially  established  by  our  previous  development.  The  process  terminates  only 
if  optimality  is  achieved  or  unboundedness  is  discovered.  If  neither  condition  is 
discovered  at  a  given  basic  solution,  then  the  objective  is  strictly  decreased.  Since 
there  are  only  a  finite  number  of  possible  basic  feasible  solutions,  and  no  basis 
repeats  because  of  the  strictly  decreasing  objective,  the  algorithm  must  reach  a  basis 
satisfying  one  of  the  two  terminating  conditions. 

Example  1.  Maximize  3x\  +  X2  +  3x3  subject  to 


2xi  +  X2  +  V3  <  2 
X\  +  2x2  3”  3x3  ^  5 

2x\  +  2x2  +  V3  <  6 


X\  >  0,  X2  >  0,  V3  >  0. 


To  transform  the  problem  into  standard  form  so  that  the  simplex  procedure  can  be 
applied,  we  change  the  maximization  to  minimization  by  multiplying  the  objective 
function  by  minus  one,  and  introduce  three  nonnegative  slack  variables  X4,  X5,  x^. 
We  then  have  the  initial  tableau 
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ai  a2  a3  a4  as  b 

©  ®  1  10  0  2 

1  2  ©  0  1  0  5 

2  2  1  0  0  1  6 

r7  ^3  -3  0  0  0  0 

First  tableau 

The  problem  is  already  in  canonical  form  with  the  three  slack  variables  serving  as 
the  basic  variables.  We  have  at  this  point  rj  =  cj—zj  -  cj ,  since  the  costs  of  the  slacks 
are  zero.  Application  of  the  criterion  for  selecting  a  column  in  which  to  pivot  shows 
that  any  of  the  first  three  columns  would  yield  an  improved  solution.  In  each  of  these 
columns  the  appropriate  pivot  element  is  determined  by  computing  the  ratios  a®  / a,y 
and  selecting  the  smallest  positive  one.  The  three  allowable  pivots  are  all  circled 
on  the  tableau.  It  is  only  necessary  to  determine  one  allowable  pivot,  and  normally 
we  would  not  bother  to  calculate  them  all.  For  hand  calculation  on  problems  of  this 
size,  however,  we  may  wish  to  examine  the  allowable  pivots  and  select  one  that  will 
minimize  (at  least  in  the  short  run)  the  amount  of  division  required.  Thus  for  this 
example  we  select  the  second  column  and  result  in: 

2  11  10  0  2 

-3  0  (D  -2  1  0  1 

-2  0  -1  -2  0  1  2 

0  -2  I  0  0  2 

Second  tableau 

We  note  that  the  objective  function — we  are  using  the  negative  of  the  original  one — 
has  decreased  from  zero  to  minus  two.  We  now  pivot  on  ®. 

@10  3  -10  1 

-3  0  1-2  1  0  1 

-5  0  0  -4  1  1  3 

^7  0  0  -3  2  0  4 

Third  tableau 

The  value  of  the  objective  function  has  now  decreased  to  minus  four  and  we  may 
pivot  in  either  the  first  or  fourth  column.  We  select  ®. 

1  1/5  0  3/5  -1/5  0  1/5 

0  3/5  1  -1/5  2/5  0  8/5 

0  10-1  014 

0  lj5  0  675  @5  0  2f/5 

Fourth  tableau 

Since  the  last  row  has  no  negative  elements,  we  conclude  that  the  solution  corre¬ 
sponding  to  the  fourth  tableau  is  optimal.  Thus  x\  =  1/5,  *2  =  0,  V3  =  8/5,  X4  = 
0,  X5  =  0,  xe  =  4  is  the  optimal  solution  with  a  corresponding  value  of  the  (nega¬ 
tive)  objective  of  -(27/5). 


3.5  Finding  a  Basic  Feasible  Solution 


49 


Degeneracy 

It  is  possible  that  in  the  course  of  the  simplex  procedure,  degenerate  basic  feasible 
solutions  may  occur.  Often  they  can  be  handled  as  a  nondegenerate  basic  feasible 
solution.  However,  it  is  possible  that  after  a  new  column  q  is  selected  to  enter  the  ba¬ 
sis,  the  minimum  of  the  ratios  may  be  zero,  implying  that  the  zero-valued  ba¬ 

sic  variable  is  the  one  to  go  out.  This  means  that  the  new  variable  xq  will  come  in  at 
zero  value,  the  objective  will  not  decrease,  and  the  new  basic  feasible  solution  will 
also  be  degenerate.  Conceivably,  this  process  could  continue  for  a  series  of  steps 
until,  finally,  the  original  degenerate  solution  is  again  obtained.  The  result  is  a  cycle 
that  could  be  repeated  indefinitely. 

Methods  have  been  developed  to  avoid  such  cycles  (see  Exercises  15-17  for  a 
full  discussion  of  one  of  them,  which  is  based  on  perturbing  the  problem  slightly 
so  that  zero-valued  variables  are  actually  small  positive  values,  and  Exercise  32  for 
Bland’s  rule,  which  is  simpler).  In  practice,  however,  such  procedures  are  found  to 
be  unnecessary.  When  degenerate  solutions  are  encountered,  the  simplex  procedure 
generally  does  not  enter  a  cycle.  However,  anticycling  procedures  are  simple,  and 
many  codes  incorporate  such  a  procedure  for  the  sake  of  safety. 


3.5  Finding  a  Basic  Feasible  Solution 

A  basic  feasible  solution  is  sometimes  immediately  available  for  linear  programs. 
For  example,  in  problems  with  constraints  of  the  form 

Ax  <  b,  x  >  0  (3.25) 

with  b  >  0,  a  basic  feasible  solution  to  the  corresponding  standard  form  of  the 
problem  is  provided  by  the  slack  variables.  This  provides  a  means  for  initiating  the 
simplex  procedure.  The  example  in  the  last  section  was  of  this  type.  An  initial  basic 
feasible  solution  is  not  always  apparent  for  other  types  of  linear  programs,  how¬ 
ever,  and  it  is  necessary  to  develop  a  means  for  determining  one  so  that  the  simplex 
method  can  be  initiated.  Interestingly  (and  fortunately),  an  auxiliary  linear  program 
and  corresponding  application  of  the  simplex  method  can  be  used  to  determine  the 
required  initial  solution. 

By  elementary  straightforward  operations  the  constraints  of  a  linear  program¬ 
ming  problem  can  always  be  expressed  in  the  form 


Ax  =  b,  x  >  0 


(3.26) 
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with  b  >  0.  In  order  to  find  a  solution  to  (3.26)  consider  the  artificial  minimization 
problem 


m 

minimize  ^  uj 

i=  1 

subject  to  Ax  +  u  =  b  (3.27) 

x  >  0,  u  >  0 

where  u  =  {u\,  U2 ,  . . . ,  um )  is  a  vector  of  artificial  variables.  If  there  is  a  feasible 
solution  to  (3.26),  then  it  is  clear  that  (3.27)  has  a  minimum  value  of  zero  with  u  =  0. 
If  (3.26)  has  no  feasible  solution,  then  the  minimum  value  of  (3.27)  is  greater  than 
zero. 

Now  (3.27)  is  itself  a  linear  program  in  the  variables  x,  u,  and  the  system  is 
already  in  canonical  form  with  basic  feasible  solution  u  =  b.  If  (3.27)  is  solved 
using  the  simplex  technique,  a  basic  feasible  solution  is  obtained  at  each  step.  If  the 
minimum  value  of  (3.27)  is  zero,  then  the  final  basic  solution  will  have  all  uj  =  0, 
and  hence  barring  degeneracy,  the  final  solution  will  have  no  uj  variables  basic.  If  in 
the  final  solution  some  Uj  are  both  zero  and  basic,  indicating  a  degenerate  solution, 
these  basic  variables  can  be  exchanged  for  nonbasic  Xj  variables  (again  at  zero  level) 
to  yield  a  basic  feasible  solution  involving  x  variables  only.  (However,  the  situation 
is  more  complex  if  A  is  not  of  full  rank.  See  Exercise  21.) 

Example  1.  Find  a  basic  feasible  solution  to 

2x\  +  X2  +  2v3  =  4 
3xi  +  3x2  +  X3  =  3 
Xi  >  0,  X2  >  0,  X3  >  0. 


We  introduce  artificial  variables  x\  >  0,  X5  >  0  and  an  objective  function  x\  +  X5. 
The  initial  tableau  is 


X\  X2  X3  X4  X5  b 

2  I  2  I  0  4 

3  3  10  13 

c7  0  0  0  i  i  0 

Initial  tableau 

A  basic  feasible  solution  to  the  expanded  system  is  given  by  the  artificial  variables. 
To  initiate  the  simplex  procedure  we  must  update  the  last  row  so  that  it  has  zero 
components  under  the  basic  variables.  This  yields: 

2  1  2  10  4 

@3  10  13 

rT  -5  -4  -3  0  0  -7 

First  tableau 
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Pivoting  in  the  column  having  the  most  negative  bottom  row  component  as  indi¬ 
cated,  we  obtain: 


0  -1  (4/^)  1 

1  1  1/30 

0  1  -4/3  0 

Second  tableau 


-2/3  2 

1/3  1 

5/3  -2 


In  the  second  tableau  there  is  only  one  choice  for  pivot,  and  it  leads  to  the  final 
tableau  shown. 

0  -3/4  1  3/4  -1/2  3/2 

1  5/4  0  -1/4  1/2  1/2 

0  0  0  1  1  0 

Final  tableau 

Both  of  the  artificial  variables  have  been  driven  out  of  the  basis,  thus  reducing  the 
value  of  the  objective  function  to  zero  and  leading  to  the  basic  feasible  solution  to 
the  original  problem 

x\  =  1/2,  X2  =  0,  X3  =  3/2. 

Using  artificial  variables,  we  attack  a  general  linear  programming  problem  by 
use  of  the  two-phase  method.  This  method  consists  simply  of  a  phase  I  in  which 
artificial  variables  are  introduced  as  above  and  a  basic  feasible  solution  is  found 
(or  it  is  determined  that  no  feasible  solutions  exist);  and  a  phase  77  in  which,  using 
the  basic  feasible  solution  resulting  from  phase  I,  the  original  objective  function 
is  minimized.  During  phase  II  the  artificial  variables  and  the  objective  function  of 
phase  I  are  omitted.  Of  course,  in  phase  I  artificial  variables  need  be  introduced  only 
in  those  equations  that  do  not  contain  slack  variables. 

Example  2.  Consider  the  problem 


minimize  4xi  +  X2  +  X3 
subject  to  2xi  +  X2  +  2x3  =  4 
3xi  +  3x2  +  X3  =  3 

xi  >  0,  X2  >  0,  X3  >  0. 


There  is  no  basic  feasible  solution  apparent,  so  we  use  the  two-phase  method.  The 
first  phase  was  done  in  Example  1  for  these  constraints,  so  we  shall  not  repeat  it 
here.  We  give  only  the  final  tableau  with  the  columns  corresponding  to  the  artificial 
variables  deleted,  since  they  are  not  used  in  phase  II.  We  use  the  new  cost  function 
in  place  of  the  old  one.  Temporarily  writing  cT  in  the  bottom  row  we  have 

Xi  X2  X3  b 

0  -3/4  1  3/2 

1  5/4  0  1/2 

cT  4  1  10 

Initial  tableau 
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Transforming  the  last  row  so  that  zeros  appear  in  the  basic  columns,  we  have 


0 

-3/4 

1 

3/2 

1 

© 

0 

1/2 

0 

-13/4 

0 

-7/2 

First 

tableau 

3/5 

0 

1 

9/5 

4/5 

1 

0 

2/5 

13/5 

0 

0 

-11/5 

Second  tableau 


and  hence  the  optimal  solution  is  x\  =0,  X2  =  2/5,  x3  =9/5. 
Example  3  (A  Free  Variable  Problem). 

minimize  -2x\  +  4^2  +  7x3  +  x4  +  5x5 
subject  to  —x\  +  X2  +  2x3  +  X4  +  2x5  =  7 
—x\  +  2x2  +  3x3  +  X4  +  X5  =  6 
—X\  +  X2  +  X3  +  2x4  +  X5  =  4 
x\  free,  X2  >  0,  X3  >  0,  X4  >  0,  X5  >  0. 


Since  x\  is  free,  it  can  be  eliminated,  as  described  in  Chap.  2,  by  solving  for  x\ 
in  terms  of  the  other  variables  from  the  first  equation  and  substituting  everywhere 
else.  This  can  all  be  done  with  the  simplex  tableau  as  follows: 


X\  X2  X3  X4  X5 

-©1212 
-12311 
-11121 
-2  4  7  1  5 

Initial  tableau 


b 

7 

6 

4 

0 


We  select  any  nonzero  element  in  the  first  column  to  pivot  on — this  will  eliminate  x\ . 


1 

0 

0 

0 


-1  -2  -1  -2  -7 

1  I  0 

0-1  1-1  -3 

2  3-11  -14 


Equivalent  problem 


We  now  save  the  first  row  for  future  reference,  but  our  linear  program  only  in¬ 
volves  the  sub-tableau  indicated.  There  is  no  obvious  basic  feasible  solution  for  this 
problem,  so  we  introduce  artificial  variables  xe  and  X7. 


c 


T 


*2 

x3 

x4, 

*5 

x6 

X7 

b 

-1 

-1 

0 

1 

1 

0 

1 

0 

1 

-1 

1 

0 

1 

3 

0 

0 

0 

0 

1 

1 

0 

Initial  tableau 

for  phase  I 
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Transforming  the  last  row  appropriately  we  obtain 


*2 

*3 

X4 

*5 

*6 

*7 

-1 

-1 

0 

© 

1 

0 

0 

1 

-1 

1 

0 

1 

T 

r 1 

1 

0 

1 

-2 

0 

0 

First  tableau — 

phase  I 

*2 

*3 

X4 

*5 

*6 

X? 

b 

-1 

-1 

0 

1 

1 

0 

1 

© 

2 

-1 

0 

-1 

1 

2 

-1 

-2 

1 

0 

2 

0 

-2 

Second  tableau- 

—phase  I 

0 

1 

-1 

1 

0 

1 

3 

1 

2 

-1 

0 

-1 

1 

2 

0 

0 

0 

0 

1 

1 

0 

Final  tableau — phase  I 


Now  we  go  back  to  the  equivalent  reduced  problem 

X2  X3  X4  x$  b 

01-11  3 

12-10  2 
cT  2  3  -1  1  -14 

Initial  tableau — phase  II 

Transforming  the  last  row  appropriately  we  proceed  with: 

0  1-113 

0  @-102 
0  -2  2  0  -21 

First  tableau — phase  II 

-1/2  0  -1/2  1  2 

1/2  1  -1/2  0  1 

1  0  1  0  -19 

Final  tableau — phase  II 

The  solution  V3  =  1 ,  X5  =  2  can  be  inserted  in  the  expression  for  x\  giving 

x  1  —  —7  +  2  •  1+2*2  =  — lj 

thus  the  final  solution  is 


X\  —  —  1 ,  X2  —  0,  .1+  =  1 .  X4  =  0,  X5  =  2. 
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3.6  Matrix  Form  of  the  Simplex  Method 

Although  the  elementary  pivot  transformations  associated  with  the  simplex  method 
are  in  many  respects  most  easily  discernible  in  the  tableau  format,  with  attention 
focused  on  the  individual  elements,  there  is  much  insight  to  be  gained  by  studying 
a  matrix  interpretation  of  the  procedure.  The  vector-matrix  relationships  that  exist 
between  the  various  rows  and  columns  of  the  tableau  lead,  however,  not  only  to 
increased  understanding  but  also,  in  a  rather  direct  way,  to  the  revised  simplex  pro¬ 
cedure  which  in  many  cases  can  result  in  considerable  computational  advantage. 
The  matrix  formulation  is  also  a  natural  setting  for  the  discussion  of  dual  linear 
programs  and  other  topics  related  to  linear  programming. 

A  preliminary  observation  in  the  development  is  that  the  tableau  at  any  point  in 
the  simplex  procedure  can  be  determined  solely  by  a  knowledge  of  which  variables 
are  basic.  As  before  we  denote  by  B  the  submatrix  of  the  original  A  matrix  consist¬ 
ing  of  the  m  columns  of  A  corresponding  to  the  basic  variables.  These  columns  are 
linearly  independent  and  hence  the  columns  of  B  form  a  basis  for  Em.  We  refer  to  B 
as  the  basis  matrix. 

As  usual,  let  us  assume  that  B  consists  of  the  first  m  columns  of  A.  Then  by 
partitioning  A,  x,  and  cT  as 


A  =  [B,  D] 

x  =  (xB,  XD),  cT  =  [eg,  Cp] , 
the  standard  linear  program  becomes 

'T  T 

minimize  cbxb  +  cdxd 

subject  to  Bxb  +  Dxd  =  b  (3.28) 

xB  >  0,  xD  >  0. 


The  basic  solution,  which  we  assume  is  also  feasible,  corresponding  to  the  basis 
B  is  x  =  (xb,  0)  where  xb  =  B-1b.  The  basic  solution  results  from  setting  xd  =  0. 
However,  for  any  value  of  xp  the  necessary  value  of  xb  can  be  computed  from 
(3.28)  as 

xB  =  B'b-B'Dxo,  (3.29) 

and  this  general  expression  when  substituted  in  the  cost  function  yields 

z  =  Cg(B_1b  -  B_1Dxd)  +  cBxD 
=  CgB-1b  +  (c£  -  cbB_1D)xd,  (3.30) 

which  expresses  the  cost  of  any  solution  to  (3.28)  in  terms  of  xd.  Thus 

rD  =  cd  -  cbB_1D  (3.31) 

is  the  relative  cost  vector  (for  nonbasic  variables).  It  is  the  components  of  this  vector 
that  are  used  to  determine  which  vector  to  bring  into  the  basis. 
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Having  derived  the  vector  expression  for  the  relative  cost  it  is  now  possible  to 
write  the  simplex  tableau  in  matrix  form.  The  initial  tableau  takes  the  form 


~  A  ! 

b 

B 

D  ! 

b 

CT 

0 

cT 

L  Hi 

0T  i 

CD  1 

0 

(3.32) 


which  is  not  in  general  in  canonical  form  and  does  not  correspond  to  a  point  in  the 
simplex  procedure.  If  the  matrix  B  is  used  as  a  basis,  then  the  corresponding  tableau 
becomes 


I 

B-1D  ! 

B-1b 

0 

cd  _  CgB_1D  ! 

-CgB“‘b 

(3.33) 


which  is  the  matrix  form  we  desire. 


*The  Revised  Simplex  Method  and  LU  Decomposition 

Extensive  experience  with  the  simplex  procedure  applied  to  problems  from  various 
fields,  and  having  various  values  of  n  and  m,  has  indicated  that  the  method  can  be 
expected  to  converge  to  an  optimum  solution  in  about  m,  or  perhaps  3m/2,  pivot 
operations.  (Except  in  the  worst  case.  See  Chap.  5.)  Thus,  particularly  if  m  is  much 
smaller  than  n ,  that  is,  if  the  matrix  A  has  far  fewer  rows  than  columns,  pivots  will 
occur  in  only  a  small  fraction  of  the  columns  during  the  course  of  optimization. 

Since  the  other  columns  are  not  explicitly  used,  it  appears  that  the  work  expended 
in  calculating  the  elements  in  these  columns  after  each  pivot  is,  in  some  sense, 
wasted  effort.  The  revised  simplex  method  is  a  scheme  for  ordering  the  compu¬ 
tations  required  of  the  simplex  method  so  that  unnecessary  calculations  are  avoided. 
In  fact,  even  if  pivoting  is  eventually  required  in  all  columns,  but  m  is  small  com¬ 
pared  to  n ,  the  revised  simplex  method  can  frequently  save  computational  effort. 

The  revised  form  of  the  simplex  method  is  this:  Given  the  inverse  B-1  of  a  current 
basis,  and  the  current  solution  xg  =  ao  =  B_1b, 

Step  1.  Calculate  the  current  relative  cost  coefficients  -  CgB_1D.  This 

can  best  be  done  by  first  calculating  yT  =  c^B-1  and  then  the  relative  cost  vector 
rD  =  cd  -  yr°  If  rD  >  0  stop;  the  current  solution  is  optimal. 

Step  2.  Determine  which  vector  aq  is  to  enter  the  basis  by  selecting  the  most 
negative  cost  coefficient;  and  calculate  =  B~laq  which  gives  the  vector  aq 
expressed  in  terms  of  the  current  basis. 

Step  3.  If  no  aiq  >  0,  stop;  the  problem  is  unbounded.  Otherwise,  calculate  the 
ratios  for  aiq  >  0  to  determine  which  vector  is  to  leave  the  basis. 

Step  4.  Update  B-1  and  the  current  solution  B-1b.  Return  to  Step  1. 

Updating  of  B-1  is  accomplished  by  the  usual  pivot  operations  applied  to  an  array 
consisting  of  B-1  and  a^,  where  the  pivot  is  the  appropriate  element  in  aq.  Of  course 
B-1b  may  be  updated  at  the  same  time  by  adjoining  it  as  another  column. 
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One  may  go  one  step  further  in  the  matrix  interpretation  of  the  simplex  method 
and  note  that  execution  of  a  single  simplex  cycle  is  not  explicitly  dependent  on 
having  B-1  but  rather  on  the  ability  to  solve  linear  systems  with  B  as  the  coefficient 
matrix.  A  decomposition  of  B  =  LU  can  be  updated  where  L  is  a  lower  triangular 
matrix  and  U  is  an  upper  triangular  matrix.  Then  each  of  the  linear  systems  can  be 
solved  by  solving  two  triangular  systems. 


3.7  Simplex  Method  for  Transportation  Problems 


The  transportation  problem  was  stated  briefly  in  Chap.  2.  We  restate  it  here.  There 
are  m  origins  that  contain  various  amounts  of  a  commodity  that  must  be  shipped  to  n 
destinations  to  meet  demand  requirements.  Specifically,  origin  i  contains  an  amount 
au  and  destination  j  has  a  requirement  of  amount  bj.  It  is  assumed  that  the  system 
is  balanced  in  the  sense  that  total  supply  equals  total  demand.  That  is, 


m  n 

Yj  a‘  =  z  bi-  (334) 

i=  1  7=1 

The  numbers  at  and  bj ,  i  =  1,2,  . . . ,  m;  j  =  1, 2,  . . . ,  n,  are  assumed  to  be  non¬ 
negative,  and  in  many  applications  they  are  in  fact  nonnegative  integers.  There  is  a 
unit  cost  Cij  associated  with  the  shipping  of  the  commodity  from  origin  i  to  destina¬ 
tion  j.  The  problem  is  to  find  the  shipping  pattern  between  origins  and  destinations 
that  satisfies  all  the  requirements  and  minimizes  the  total  shipping  cost. 

In  mathematical  terms  the  above  problem  can  be  expressed  as  finding  a  set  of  v// 
s,  i  =  1,2,  . . . ,  m\  j  =  1,2,  . . . ,  n,  to 


minimize 


m 


n 


EE 

z=l  7=1 


cijxij 


subject  to 


for  i  =  1 , 2, . . . , m 


m 

2_Jxij  =  bj  for  j  =  1,2,...,// 

i—  1 

Xij  >  0  for  all  i  and  j. 


(3.35) 


This  mathematical  problem,  together  with  the  assumption  (3.34),  is  the  general 
transportation  problem.  In  the  shipping  context,  the  variables  xtj  represent  the 
amounts  of  the  commodity  shipped  from  origin  i  to  destination  j. 

The  structure  of  the  problem  can  be  seen  more  clearly  by  writing  the  constraint 
equations  in  standard  form: 
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Ml  +  M2  +  •  •  •  +  X\n 


-  Cl\ 


X21  +  *22  +  •  •  •  +  X2n 


=  a2 


Xml  '  *m2  "■"•••  +  Xmn  —  Cl 


m 


Ml 


+  *21 


Xml 


Xl2 


+  *22 


T  *m2 


=  ^1 
=  b2 


X\ ri 


+  *2  n 


:  (3.36) 


+  X) 


mn 


The  structure  is  perhaps  even  more  evident  when  the  coefficient  matrix  A  of  the 
system  of  equations  above  is  expressed  in  vector-matrix  notation  as 


(3.37) 


where  l  =  (l,l,...,l)is  ^-dimensional,  and  where  each  I  is  an  nxn  identity  matrix. 

In  practice  it  is  usually  unnecessary  to  write  out  the  constraint  equations  of  the 
transportation  problem  in  the  explicit  form  (3.36).  A  specific  transportation  problem 
is  generally  defined  by  simply  presenting  the  data  in  compact  form,  such  as: 


^  —  (^1?  ^2?  •  •  •  »  ^m) 


b  =  (b  1,  b2, bn) 


c\\  c  12  •  •  • 

C21  C22  •  •  • 


^ml  ^"m2 


mn 


The  solution  can  also  be  represented  by  an  m  x  n  array,  and  as  we  shall  see,  all 
computations  can  be  made  on  arrays  of  a  similar  dimension. 


Example  1.  As  an  example,  which  will  be  solved  completely  in  a  later  section,  a 
specific  transportation  problem  with  four  origins  and  five  destinations  is  defined  by 

a  =  (30,  80,  10,  60) 

3  4  6  8  9 
2  24  5  5 
C ”  22232  ' 

_3  3  2  4  2_ 

b  =  (10,  50,  20,  80,  20) 


Note  that  the  balance  requirement  is  satisfied,  since  the  sum  of  the  supply  and  the 
demand  are  both  180. 
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Finding  a  Basic  Feasible  Solution 

A  first  step  in  the  study  of  the  structure  of  the  transportation  problem  is  to  show 
that  there  is  always  a  feasible  solution,  thus  establishing  that  the  problem  is  well 
defined.  A  feasible  solution  can  be  found  by  allocating  shipments  from  origins  to 
destinations  in  proportion  to  supply  and  demand  requirements.  Specifically,  let  S 
be  equal  to  the  total  supply  (which  is  also  equal  to  the  total  demand).  Then  let 
Xij  =  ciibj/S  for  i  =  1,2,  . . . ,  m\  j  —  1,2,  . . . ,  n.  The  reader  can  easily  verify  that 
this  is  a  feasible  solution.  We  also  note  that  the  solutions  are  bounded,  since  each 
is  bounded  by  a;  (and  by  bj).  A  bounded  program  with  a  feasible  solution  has  an 
optimal  solution.  Thus,  a  transportation  problem  always  has  an  optimal  solution. 

A  second  step  in  the  study  of  the  structure  of  the  transportation  problem  is  based 
on  a  simple  examination  of  the  constraint  equations.  Clearly  there  are  m  equations 
corresponding  to  origin  constraints  and  n  equations  corresponding  to  destination 
constraints — a  total  of  n  +  m.  However,  it  is  easily  noted  that  the  sum  of  the  origin 
equations  is 


m  n 

Y  Y  x‘j 

i=  1  j= 1 

and  the  sum  of  the  destination  equations  is 

n  m  n 

2  bJ-  <3-39> 

./'=  1  i=  1  j=  1 


=  2  a <’  (3.38) 

1=1 


The  left-hand  sides  of  these  equations  are  equal.  Since  they  were  formed  by  two  dis¬ 
tinct  linear  combinations  of  the  original  equations,  it  follows  that  the  equations  in  the 
original  system  are  not  independent.  The  right-hand  sides  of  (3.38)  and  (3.39)  are 
equal  by  the  assumption  that  the  system  is  balanced,  and  therefore  the  two  equations 
are,  in  fact,  consistent.  However,  it  is  clear  that  the  original  system  of  equations  is 
redundant.  This  means  that  one  of  the  constraints  can  be  eliminated  without  chang¬ 
ing  the  set  of  feasible  solutions.  Indeed,  any  one  of  the  constraints  can  be  chosen 
as  the  one  to  be  eliminated,  for  it  can  be  reconstructed  from  those  remaining.  It  fol¬ 
lows  that  a  basis  for  the  transportation  problem  consists  of  m  +  n  -  1  vectors,  and 
a  nondegenerate  basic  feasible  solution  consists  of  m  +  n  -  1  variables.  The  simple 
solution  found  earlier  in  this  section  is  clearly  not  a  basic  solution. 

There  is  a  straightforward  way  to  compute  an  initial  basic  feasible  solution  to 
a  transportation  problem.  The  method  is  worth  studying  at  this  stage  because  it 
introduces  the  computational  process  that  is  the  foundation  for  the  general  solution 
technique  based  on  the  simplex  method.  It  also  begins  to  illustrate  the  fundamental 
property  of  the  structure  of  transportation  problems. 
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The  Northwest  Corner  Rule 

This  procedure  is  conducted  on  the  solution  array  shown  below: 


*11 

*12 

*13 

. . . 

*172 

a\ 

*21 

*22 

*23 

. . . 

*277 

a2 

• 

:  (3.40) 

*777  1 

*m2 

*772  3 

. . . 

*77777 

am 

b\ 

hi 

b3 

. . . 

bn 

The  individual  elements  of  the  array  appear  in  cells  and  represent  a  solution.  An 
empty  cell  denotes  a  value  of  zero. 

Beginning  with  all  empty  cells,  the  procedure  is  given  by  the  following  steps: 
Step  1 .  Start  with  the  cell  in  the  upper  left-hand  corner. 

Step  2.  Allocate  the  maximum  feasible  amount  consistent  with  row  and  column 
sum  requirements  involving  that  cell.  (At  least  one  of  these  requirements  will 
then  be  met.) 

Step  3.  Move  one  cell  to  the  right  if  there  is  any  remaining  row  requirement  (sup¬ 
ply).  Otherwise  move  one  cell  down.  If  all  requirements  are  met,  stop;  otherwise 
go  to  Step  2. 

The  procedure  is  called  the  Northwest  Corner  Rule  because  at  each  step  it  selects 
the  cell  in  the  upper  left-hand  corner  of  the  subarray  consisting  of  current  nonzero 
row  and  column  requirements. 


Example  1.  A  basic  feasible  solution  constructed  by  the  Northwest  corner  Rule  is 
shown  below  for  Example  1  of  the  last  section. 


10 

20 

30 

30 

20 

30 

80 

10 

10  (3.41) 

40 

20 

60 

10 

50 

20 

80 

20 

In  the  first  step,  at  the  upper  left-hand  corner,  a  maximum  of  10  units  could  be 
allocated,  since  that  is  all  that  was  required  by  column  1.  This  left  30  -  10  =  20 
units  required  in  the  first  row.  Next,  moving  to  the  second  cell  in  the  top  row,  the 
remaining  20  units  were  allocated.  At  this  point  the  row  1  requirement  is  met,  and 
it  is  necessary  to  move  down  to  the  second  row.  The  reader  should  be  able  to  follow 
the  remaining  steps  easily. 

There  is  the  possibility  that  at  some  point  both  the  row  and  column  requirements 
corresponding  to  a  cell  may  be  met.  The  next  entry  will  then  be  a  zero,  indicating  a 
degenerate  basic  solution.  In  such  a  case  there  is  a  choice  as  to  where  to  place  the 
zero.  One  can  either  move  right  or  move  down  to  enter  the  zero.  Two  examples  of 
degenerate  solutions  to  a  problem  are  shown  below: 
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30 

30 

30 

30 

20 

20 

40 

20 

20 

0 

40 

0 

20 

20 

20 

20 
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60 

20 

40 

60 

50 

20 

40 

40 

50 

20 

40 

40 

It  should  be  clear  that  the  Northwest  Corner  Rule  can  be  used  to  obtain  different 
basic  feasible  solutions  by  first  permuting  the  rows  and  columns  of  the  array  before 
the  procedure  is  applied.  Or  equivalently,  one  can  do  this  indirectly  by  starting  the 
procedure  at  an  arbitrary  cell  and  then  considering  successive  rows  and  columns  in 
an  arbitrary  order. 


Basis  Triangularity 

We  now  establish  the  most  important  structural  property  of  the  transportation  prob¬ 
lem:  the  triangularity  of  all  bases.  This  property  simplifies  the  process  of  solution 
of  a  system  of  equations  whose  coefficient  matrix  corresponds  to  a  basis,  and  thus 
leads  to  efficient  implementation  of  the  simplex  method. 

The  concept  of  upper  and  lower  triangular  matrices  was  introduced  in  connection 
with  Gaussian  elimination  methods,  see  Appendix  C.  It  is  useful  at  this  point  to 
generalize  slightly  the  notion  of  upper  and  lower  triangularity. 

Definition.  A  nonsingular  square  matrix  M  is  said  to  be  triangular  if  by  a  permutation  of 
its  rows  and  columns  it  can  be  put  in  the  form  of  a  lower  triangular  matrix. 

There  is  a  simple  and  useful  procedure  for  determining  whether  a  given  matrix 
M  is  triangular: 

Step  1 .  Find  a  row  with  exactly  one  nonzero  entry. 

Step  2.  Form  a  submatrix  of  the  matrix  used  in  Step  1  by  crossing  out  the  row 
found  in  Step  1  and  the  column  corresponding  to  the  nonzero  entry  in  that  row. 
Return  to  Step  1  with  this  submatrix. 

If  this  procedure  can  be  continued  until  all  rows  have  been  eliminated,  then  the 
matrix  is  triangular.  It  can  be  put  in  lower  triangular  form  explicitly  by  arranging 
the  rows  and  columns  in  the  order  that  was  determined  by  the  procedure. 

Example  1.  Shown  below  on  the  left  is  a  matrix  before  the  above  procedure  is  ap¬ 
plied  to  it.  Indicated  along  the  edges  of  this  matrix  is  the  order  in  which  the  rows 
and  columns  are  indexed  according  to  the  procedure.  Shown  at  the  right  is  the  same 
matrix  when  its  rows  and  columns  are  permuted  according  to  the  order  found. 
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Triangularization 


We  are  now  prepared  to  derive  the  most  important  structural  property  of  the  trans¬ 
portation  problem. 

Basis  Triangularity  Theorem.  Every  basis  of  the  transportation  problem  is  triangular. 

Proof.  Refer  to  the  system  of  constraints  (3.36).  Let  us  change  the  sign  of  the  top 
half  of  the  system;  then  the  coefficient  matrix  of  the  system  consists  of  entries  that 
are  either  +1,  -1,  or  0.  Following  the  result  of  the  theorem  in  Sect.  3.7,  delete  any 
one  of  the  equations  to  eliminate  the  redundancy.  From  the  resulting  coefficient 
matrix,  form  a  basis  B  by  selecting  a  nonsingular  subset  of  m  +  n  -  1  columns. 

Each  column  of  B  contains  at  most  two  nonzero  entries,  a  +  1  and  a  -  1.  Thus 
there  are  at  most  2  (m  +  n-  1)  nonzero  entries  in  the  basis.  However,  if  every  column 
contained  two  nonzero  entries,  then  the  sum  of  all  rows  would  be  zero,  contradict¬ 
ing  the  nonsingularity  of  B.  Thus  at  least  one  column  of  B  must  contain  only  one 
nonzero  entry.  This  means  that  the  total  number  of  nonzero  entries  in  B  is  less  than 
2 (m  +  n  —  1).  It  then  follows  that  there  must  be  a  row  with  only  one  nonzero  entry; 
for  if  every  row  had  two  or  more  nonzero  entries,  the  total  number  would  be  at  least 
2  (m  +  n-  1).  This  means  that  the  first  step  of  the  procedure  for  verifying  triangular¬ 
ity  is  satisfied.  A  similar  argument  can  be  applied  to  the  submatrix  of  B  obtained  by 
crossing  out  the  row  with  the  single  nonzero  entry  and  the  column  corresponding  to 
that  entry;  that  submatrix  must  also  contain  a  row  with  a  single  nonzero  entry.  This 
argument  can  be  continued,  establishing  that  the  basis  B  is  triangular.  I 

Example  2.  As  an  illustration  of  the  Basis  Triangularity  Theorem,  consider  the  ba¬ 
sis  selected  by  the  Northwest  Comer  Rule  in  Example  1.  This  basis  is  represented 
below,  except  that  only  the  basic  variables  are  indicated,  not  their  values. 
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A  row  in  a  basis  matrix  corresponds  to  an  equation  in  the  original  system  and  is 
associated  with  a  constraint  either  on  a  row  or  column  sum  in  the  solution  array.  In 
this  example  the  equation  corresponding  to  the  first  column  sum  contains  only  one 
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basis  variable,  x\\.  The  value  of  this  variable  can  be  found  immediately  to  be  10. 
The  next  equation  corresponds  to  the  first  row  sum.  The  corresponding  variable  is 
X\2,  which  can  be  found  to  be  20,  since  x\\  is  known.  Progression  in  this  manner 
through  the  basis  variables  is  equivalent  to  back  substitution. 

The  importance  of  triangularity  is,  of  course,  the  associated  method  of  back 
substitution  for  the  solution  of  a  triangular  system  of  equations,  as  discussed  in 
Appendix  C.  Moreover,  since  any  basis  matrix  is  triangular  and  all  nonzero  ele¬ 
ments  are  equal  to  one  (or  minus  one  if  the  signs  of  some  equations  are  changed),  it 
follows  that  the  process  of  back  substitution  will  simply  involve  repeated  additions 
and  subtractions  of  the  given  row  and  column  sums.  No  multiplication  is  required.  It 
therefore  follows  that  if  the  original  row  and  column  totals  are  integers,  the  values  of 
all  basic  variables  will  be  integers.  This  is  an  important  result,  which  we  summarize 
by  a  corollary  to  the  Basis  Triangularity  Theorem. 

Corollary.  If  the  row  and  column  sums  of  a  transportation  problem  are  integers,  then  the 
basic  variables  in  any  basic  solution  are  integers. 


The  Transportation  Simplex  Method 

Now  that  the  structural  properties  of  the  transportation  problem  have  been  devel¬ 
oped,  it  is  a  relatively  straightforward  task  to  work  out  the  details  of  the  simplex 
method  for  the  transportation  problem.  A  major  objective  is  to  exploit  fully  the  tri¬ 
angularity  property  of  bases  in  order  to  achieve  both  computational  efficiency  and 
a  compact  representation  of  the  method.  The  method  used  is  actually  a  direct  adap¬ 
tation  of  the  version  of  the  revised  simplex  method  presented  in  the  first  part  of 
Sect.  3.6.  The  basis  is  never  inverted;  instead,  its  triangular  form  is  used  directly  to 
solve  for  all  required  variables. 


Simplex  Multipliers 

Simplex  multipliers  are  associated  with  the  constraint  equations.  In  this  case  we 
partition  the  vector  of  multipliers  as  y  =  (u,  v).  Here,  U[  represents  the  multiplier 
associated  with  the  ith  row  sum  constraint,  and  Vj  represents  the  multiplier  associ¬ 
ated  with  the  jth  column  sum  constraint.  Since  one  of  the  constraints  is  redundant, 
an  arbitrary  value  may  be  assigned  to  any  one  of  the  multipliers  (see  Exercise  4, 
Chap.  4).  For  notational  simplicity  we  shall  at  this  point  set  vn  =  0. 

Given  a  basis  B,  the  simplex  multipliers  are  found  to  be  the  solution  to  the 
equation  yrB  =  c^.  To  determine  the  explicit  form  of  these  equations,  we  again 
refer  to  the  original  system  of  constraints  (3.36).  If  Xij  is  basic,  then  the  correspond¬ 
ing  column  from  A  will  be  included  in  B.  This  column  has  exactly  two  +1  entries: 
one  in  the  ith  position  of  the  top  portion  and  one  in  the  yth  position  of  the  bottom 
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portion.  This  column  thus  generates  the  simplex  multiplier  equation  ui  +  Vj  =  c/j, 
since  ui  and  Vj  are  the  corresponding  components  of  the  multiplier  vector.  Overall, 
the  simplex  multiplier  equations  are 

Ui  +  vj  =  Cij ,  (3.42) 

for  all  i,  j  for  which  xij  is  basic.  The  coefficient  matrix  of  this  system  is  the  transpose 
of  the  basis  matrix  and  hence  it  is  triangular.  Thus,  this  system  can  be  solved  by  back 
substitution.  This  is  similar  to  the  procedure  for  finding  the  values  of  basic  variables 
and,  accordingly,  as  another  corollary  of  the  Triangular  Basis  Theorem,  an  integer 
property  holds  for  simplex  multipliers. 

Corollary.  If  the  unit  costs  Cij  of  a  transportation  problem  are  all  integers,  then  (assuming 
one  simplex  multiplier  is  set  arbitrarily  equal  to  an  integer)  the  simplex  multipliers  associ¬ 
ated  with  any  basis  are  integers. 

Once  the  simplex  multipliers  are  known,  the  relative  cost  coefficients  for  nonba- 
sic  variables  can  be  found  in  the  usual  manner  as  -  yrD.  In  this  case  the 

relative  cost  coefficients  are 

rij  ~  cij  ~  ui  ~  vj  f°r  i  ~  1, 2, . . . ,  m 

j  =  1, 2, . . . ,  n.  (3.43) 

This  relation  is  valid  for  basic  variables  as  well  if  we  define  relative  cost  coefficients 
for  them — having  value  zero. 

Given  a  basis,  computation  of  the  simplex  multipliers  is  quite  similar  to  the  cal¬ 
culation  of  the  values  of  the  basic  variables.  The  calculation  is  easily  carried  out 
on  an  array  of  the  form  shown  below,  where  the  circled  elements  correspond  to  the 
positions  of  the  basic  variables  in  the  current  basis. 
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In  this  case  the  main  part  of  the  array,  with  the  coefficients  c/7  ,  remains  fixed,  and 
we  calculate  the  extra  column  and  row  corresponding  to  u  and  v. 

The  procedure  for  calculating  the  simplex  multipliers  is  this: 

Step  1.  Assign  an  arbitrary  value  to  any  one  of  the  multipliers. 

Step  2.  Scan  the  rows  and  columns  of  the  array  until  a  circled  element  C[j  is  found 
such  that  either  ui  or  vj  (but  not  both)  has  already  been  determined. 

Step  3.  Compute  the  undetermined  ut  or  Vj  from  the  equation  ctj  -  ut  +  Vj.  If  all 
multipliers  are  determined,  stop.  Otherwise,  return  to  Step  2. 

The  triangularity  of  the  basis  guarantees  that  this  procedure  can  be  carried 
through  to  determine  all  the  simplex  multipliers. 
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Example  1.  Consider  the  cost  array  of  Example  1  of  Sect.  5.1,  which  is  shown  below 
with  the  circled  elements  corresponding  to  a  basic  feasible  solution  (found  by  the 
Northwest  Corner  Rule).  Only  these  numbers  are  used  in  the  calculation  of  the 
multipliers. 

[®  ®  6  8  9 

2  ®  ®  ®  5 

2  2  2(1  2  ' 

3  3  2  ®® 

We  first  arbitrarily  set  V5  =  0.  We  then  scan  the  cells,  searching  for  a  circled  element 
for  which  only  one  multiplier  must  be  determined.  This  is  the  bottom  right  corner 
element,  and  it  gives  u\  -  2.  Then,  from  the  equation  4  =  2  +  V4,  V4  is  found  to  be  2. 
Next,  U3  and  U2  are  determined,  then  V3  and  V2,  and  finally  u\  and  vi.  The  result  is 
shown  below: 
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Cycle  of  Change 

In  accordance  with  the  general  simplex  procedure,  if  a  nonbasic  variable  has  an 
associated  relative  cost  coefficient  that  is  negative,  then  that  variable  is  a  candidate 
for  entry  into  the  basis.  As  the  value  of  this  variable  is  gradually  increased,  the 
values  of  the  current  basic  variables  will  change  continuously  in  order  to  maintain 
feasibility.  Then,  as  usual,  the  value  of  the  new  variable  is  increased  precisely  to  the 
point  where  one  of  the  old  basic  variables  is  driven  to  zero. 

We  must  work  out  the  details  of  how  the  values  of  the  current  basic  variables 
change  as  a  new  variable  is  entered.  If  the  new  basic  vector  is  d,  then  the  change 
in  the  other  variables  is  given  by  -B-1d,  where  B  is  the  current  basis.  Hence,  once 
again  we  are  faced  with  a  problem  of  solving  a  system  associated  with  the  triangular 
basis,  and  once  again  the  solution  has  special  properties.  In  the  next  theorem  recall 
that  A  is  defined  by  (3.37). 

Theorem.  Let  B  be  a  basis  from  A  (ignoring  one  row),  and  let  d  be  another  column.  Then 
the  components  of  the  vector  w  =  B-1d  are  either  0,  +1,  or  -1. 

Proof.  Let  w  be  the  solution  to  the  equation  Bw  =  d.  Then  w  is  the  representation 
of  d  in  terms  of  the  basis.  This  equation  can  be  solved  by  Cramer’s  rule  as 

det  B* 


wk  = 


detB  ’ 
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where  is  the  matrix  obtained  by  replacing  the  Mi  column  of  B  by  d.  Both  B  and 
B^  are  submatrices  of  the  original  constraint  matrix  A.  The  matrix  B  may  be  put 
in  triangular  form  with  all  diagonal  elements  equal  to  +1.  Hence,  accounting  for 
the  sign  change  that  may  result  from  the  combined  row  and  column  interchanges, 
detB  =  +1  or  -1.  Likewise,  it  can  be  shown  (see  Exercise  3)  that  detB^  =  0,  +1, 
or  - 1 .  We  conclude  that  each  component  of  w  is  either  0,  + 1 ,  or  - 1. 1 

The  implication  of  the  above  result  is  that  when  a  new  variable  is  added  to  the 
solution  at  a  unit  level,  the  current  basic  variables  will  each  change  by  + 1 ,  - 1 ,  or  0. 
If  the  new  variable  has  a  value  6 ,  then,  correspondingly,  the  basic  variables  change 
by  + 6 ,  -6,  or  0.  It  is  therefore  only  necessary  to  determine  the  signs  of  change  for 
each  basic  variable. 

The  determination  of  these  signs  is  again  accomplished  by  row  and  column  scan¬ 
ning.  Operationally,  one  assigns  a  +  to  the  cell  of  the  entering  variable  to  represent 
a  change  of  +0,  where  6  is  yet  to  be  determined.  Then  +’s,  -’s,  and  0’s  are  assigned, 
one  by  one,  to  the  cells  of  some  basic  variables,  indicating  changes  of  +0,  - 6 ,  or 
0  to  maintain  a  solution.  As  usual,  after  each  step  there  will  always  be  an  equation 
that  uniquely  determines  the  sign  to  be  assigned  to  another  basic  variable.  The  result 
will  be  a  sequence  of  pluses  and  minuses  assigned  to  cells  that  form  a  cycle  leading 
from  the  cell  of  the  entering  variable  back  to  that  cell.  In  essence,  the  new  change  is 
part  of  a  cycle  of  redistribution  of  the  commodity  flow  in  the  transportation  system. 

Once  the  sequence  of  +’s,  -’s,  and  0’s  is  determined,  the  new  basic  feasible 
solution  is  found  by  setting  the  level  of  the  change  6.  This  is  set  so  as  to  drive  one 
of  the  old  basic  variables  to  zero.  One  must  simply  examine  those  basic  variables 
for  which  a  minus  sign  has  been  assigned,  for  these  are  the  ones  that  will  decrease 
as  the  new  variable  is  introduced.  Then  6  is  set  equal  to  the  smallest  magnitude  of 
these  variables.  This  value  is  added  to  all  cells  that  have  a  +  assigned  to  them  and 
subtracted  from  all  cells  that  have  a  -  assigned.  The  result  will  be  the  new  basic 
feasible  solution. 

The  procedure  is  illustrated  by  the  following  example. 

Example  2.  A  completed  solution  array  is  shown  below: 
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10 

20- 

10+ 

30 

20+ 

10° 

30- 

60 

10° 

10 

10- 
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40° 

50 

40 

10 

30 

40 

40 

In  this  example  V53  is  the  entering  variable,  so  a  plus  sign  is  assigned  there.  The 
signs  of  the  other  cells  were  determined  in  the  order  V13,  V23,  V25,  V35,  V32,  V31 ,  *41, 
V51,  X54.  The  smallest  variable  with  a  minus  assigned  to  it  is  X51  =  10.  Thus  we  set 

e  =  10. 


66 


3  The  Simplex  Method 


The  Transportation  Simplex  Algorithm 

It  is  now  possible  to  put  together  the  components  developed  to  this  point  in  the  form 
of  a  complete  revised  simplex  procedure  for  the  transportation  problem.  The  steps 
are: 

Step  1.  Compute  an  initial  basic  feasible  solution  using  the  Northwest  Corner 
Rule  or  some  other  method. 

Step  2.  Compute  the  simplex  multipliers  and  the  relative  cost  coefficients.  If  all 
relative  cost  coefficients  are  nonnegative,  stop;  the  solution  is  optimal.  Otherwise, 
go  to  Step  3. 

Step  3.  Select  a  nonbasic  variable  corresponding  to  a  negative  cost  coefficient  to 
enter  the  basis  (usually  the  one  corresponding  to  the  most  negative  cost  coeffi¬ 
cient).  Compute  the  cycle  of  change  and  set  6  equal  to  the  smallest  basic  variable 
with  a  minus  assigned  to  it.  Update  the  solution.  Go  to  Step  2. 

Example  3.  We  can  now  completely  solve  the  problem  that  was  introduced  in  Exam¬ 
ple  1  of  the  first  section.  The  requirements  and  a  first  basic  feasible  solution  obtained 
by  the  Northwest  Corner  Rule  are  shown  below.  The  plus  and  minus  signs  indicated 
on  the  array  should  be  ignored  at  this  point,  since  they  cannot  be  computed  until  the 
next  step  is  completed. 
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The  cost  coefficients  of  the  problem  are  shown  in  the  array  below,  with  the  circled 
cells  corresponding  to  the  current  basic  variables.  The  simplex  multipliers,  com¬ 
puted  by  row  and  column  scanning,  are  shown  as  well. 
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The  relative  cost  coefficients  are  found  by  subtracting  uj  +  Vj  from  ctj.  In  this  case 
the  only  negative  result  is  in  cell  4,3;  so  variable  V43  will  be  brought  into  the  basis. 
Thus  a  +  is  entered  into  this  cell  in  the  original  array,  and  the  cycle  of  zeros  and  plus 
and  minus  signs  is  determined  as  shown  in  that  array.  (It  is  not  necessary  to  continue 
scanning  once  a  complete  cycle  is  determined.) 

The  smallest  basic  variable  with  a  minus  sign  is  20  and,  accordingly,  20  is  added 
or  subtracted  from  elements  of  the  cycle  as  indicated  by  the  signs.  This  leads  to  the 
new  basic  feasible  solution  shown  in  the  array  below: 
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The  new  simplex  multipliers  corresponding  to  the  new  basis  are  computed,  and 
the  cost  array  is  revised  as  shown  below.  In  this  case  all  relative  cost  coefficients  are 
positive,  indicating  that  the  current  solution  is  optimal. 
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Degeneracy 

As  in  all  linear  programming  problems,  degeneracy,  corresponding  to  a  basic  vari¬ 
able  having  the  value  zero,  can  occur  in  the  transportation  problem.  If  degeneracy 
is  encountered  in  the  simplex  procedure,  it  can  be  handled  quite  easily  by  introduc¬ 
tion  of  the  standard  perturbation  method  (see  Exercise  15,  Chap.  3).  In  this  method 
a  zero-valued  basic  variable  is  assigned  the  value  s  and  is  then  treated  in  the  usual 
way.  If  it  later  leaves  the  basis,  then  the  s  can  be  dropped. 

Example  4.  To  illustrate  the  method  of  dealing  with  degeneracy,  consider  a  modifi¬ 
cation  of  Example  3,  with  the  fourth  row  sum  changed  from  60  to  20  and  the  fourth 
column  sum  changed  from  80  to  40.  Then  the  initial  basic  feasible  solution  found 
by  the  Northwest  Corner  Rule  is  degenerate.  An  s  is  placed  in  the  array  for  the 
zero- valued  basic  variable  as  shown  below: 
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The  relative  cost  coefficients  will  be  the  same  as  in  Example  3,  and  hence  again 
V43  should  be  chosen  to  enter,  and  the  cycle  of  change  is  the  same  as  before.  In 
this  case,  however,  the  change  is  only  s ,  and  variable  X44  leaves  the  basis.  The  new 
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relative  cost  coefficients  are  all  positive,  indicating  that  the  new  solution  is  optimal. 
Now  the  s  can  be  dropped  to  yield  the  final  solution  (which  is,  itself,  degenerate  in 
this  case). 
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*3.8  Decomposition 

Large  linear  programming  problems  usually  have  some  special  structural  form  that 
can  (and  should)  be  exploited  to  develop  efficient  computational  procedures.  One 
common  structure  is  where  there  are  a  number  of  separate  activity  areas  that  are 
linked  through  common  resource  constraints.  An  example  is  provided  by  a  multidi¬ 
visional  firm  attempting  to  minimize  the  total  cost  of  its  operations.  The  divisions 
of  the  firm  must  each  meet  internal  requirements  that  do  not  interact  with  the  con¬ 
straints  of  other  divisions;  but  in  addition  there  are  common  resources  that  must  be 
shared  among  divisions  and  thereby  represent  linking  constraints. 

A  problem  of  this  form  can  be  solved  by  the  Dantzig- Wolfe  decomposition 
method  described  in  this  section.  The  method  is  an  iterative  process  where  at  each 
step  a  number  of  separate  subproblems  are  solved.  The  subproblems  are  themselves 
linear  programs  within  the  separate  areas  (or  within  divisions  in  the  example  of 
the  firm).  The  objective  functions  of  these  subproblems  are  varied  from  iteration  to 
iteration  and  are  determined  by  a  separate  calculation  based  on  the  results  of  the 
previous  iteration.  This  action  coordinates  the  individual  subproblems  so  that,  ulti¬ 
mately,  the  solution  to  the  overall  problem  is  solved.  The  method  can  be  derived  as 
a  special  version  of  the  revised  simplex  method,  where  the  subproblems  correspond 
to  evaluation  of  reduced  cost  coefficients  for  the  main  problem. 

To  describe  the  method  we  consider  the  linear  program  in  standard  form 


minimize  c  x 
subject  to  Ax  =  b,  x  >  0. 


(3.44) 


Suppose,  for  purposes  of  this  entire  section,  that  the  A  matrix  has  the  special  “block- 
angular”  structure: 

Li  L2  •  •  •  Ln 
Ai 

A  =  A2  (3.45) 
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By  partitioning  the  vectors  x,  cT,  and  b  consistent  with  this  partition  of  A,  the 
problem  can  be  rewritten  as 


N 

minimize  ^  c J X; 

i- 1 
N 

subject  to  ^  L jXj  =  bp  (3.46) 

i=i 

A  jXj  =  b, 

Xt  >  0,  i  -  1,  .  .  .,  N. 


This  may  be  viewed  as  a  problem  of  minimizing  the  total  cost  of  N  different  linear 
programs  that  are  independent  except  for  the  first  constraint,  which  is  a  linking 
constraint  of,  say,  dimension  m. 

Each  of  the  subproblems  is  of  the  form 

minimize  c-  X; 

subject  to  A {Xi  =  b,,  x,  >0. 


The  constraint  set  for  the  ith  subproblem  is  Si  =  {x*  :  A;X;  =  b*,  X;  >  0}.  As 
for  any  linear  program,  this  constraint  set  Si  is  a  polytope  and  can  be  expressed 
as  the  intersection  of  a  finite  number  of  closed  half-spaces.  There  is  no  guarantee 
that  each  Si  is  bounded,  even  if  the  original  linear  program  (3.44)  has  a  bounded 
constraint  set.  We  shall  assume  for  simplicity,  however,  that  each  of  the  polytopes 
Si,  i  -  1,  . . . ,  N  is  indeed  bounded  and  hence  is  a  polyhedron.  One  may  guarantee 
that  this  assumption  is  satisfied  by  placing  artificial  (large)  upper  bounds  on  each  xt. 

Under  the  boundedness  assumption,  each  polyhedron  Si  consists  entirely  of 
points  that  are  convex  combinations  of  its  extreme  points.  Thus,  if  the  extreme  points 
of  Si  are  {xn,  x^,  . . . ,  x^.},  then  any  point  X;  e  Si  can  be  expressed  in  the  form 

Ki 

^ i  ~  2  ^ij^ij’ 

7-1 

Ki  (3  47) 

where  2  =  1 

7=1 

and  aij  >  0,  j  =  1,  . . . ,  Ki. 

The  a  if  s  are  the  weighting  coefficients  of  the  extreme  points. 

We  now  convert  the  original  linear  program  to  an  equivalent  master  problem, 
of  which  the  objective  is  to  find  the  optimal  weighting  coefficients  for  each  poly¬ 
hedron,  Si.  Corresponding  to  each  extreme  point  Xy  in  Si,  define  pij  =  c Txy  and 
q ij  =  L {Xij.  Clearly  p^  is  the  equivalent  cost  of  the  extreme  point  Xy,  and  is  its 
equivalent  activity  vector  in  the  linking  constraints. 
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Then  the  original  linear  program  (3.44)  is  equivalent,  using  (3.47),  to  the  master 
problem : 


N  Ki 

minimize  zz  Pijaij 

i=  1  7=  1 

subject  to  zz  =  b 

i=l  7=1 


0 


Ki 

Xj  —  1 

7=1 

>0,  7 


(3.48) 


i  =  1 ,  . . . ,  TV. 


1 ,  . . . , 


This  master  problem  has  variables 


cr  (cr^ ,  . . . ,  ,  U21  ?  . . . ,  •  •  •  ?  •  •  •  ?  ^nk ^ ) 


and  can  be  expressed  more  compactly  as 

minimize  pror 

subject  to  Qor  =  g,  or  >  0  (3.49) 

where  gr  =  [bj,  1,  1,  1];  the  element  of  p  associated  with  atj  is  and  the 

column  of  Q  associated  with  is 


with  e,  denoting  the  zth  unit  vector  in  EN . 

Suppose  that  at  some  stage  of  the  revised  simplex  method  for  the  master  prob¬ 
lem  we  know  the  basis  B  and  corresponding  simplex  multipliers  yT  =  PbB  *.  The 
corresponding  relative  cost  vector  is  -  y7  I),  having  components 


rij  =  PH 


q 


ij 


e; 


(3.50) 


It  is  not  necessary  to  calculate  all  the  r*/s;  it  is  only  necessary  to  determine  the 
minimal  r^.  If  the  minimal  value  is  nonnegative,  the  current  solution  is  optimal  and 
the  process  terminates.  If,  on  the  other  hand,  the  minimal  element  is  negative,  the 
corresponding  column  should  enter  the  basis. 

The  search  for  the  minimal  element  in  (3.50)  is  normally  made  with  respect 
to  nonbasic  columns  only.  The  search  can  be  formally  extended  to  include  basic 
columns  as  well,  however,  since  for  basic  elements 


T 

Pu-y 
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The  extra  zero  values  do  not  influence  the  subsequent  procedure,  since  a  new  column 
will  enter  only  if  the  minimal  value  is  less  than  zero. 

We  therefore  define  r*  as  the  minimum  relative  cost  coefficient  for  all  possible 
basis  vectors.  That  is, 

r*  =  minimum  {  r*  =  minimum!  pa  -  \T 

(6(1, \  '  'F,J 

Using  the  definitions  of  pij  and  q,y,  this  becomes 

r*  =  minimum  j  c 

where  yo  is  the  vector  made  up  of  the  first  m  elements  of  y,  m  being  the  number  of 
rows  of  L j  [the  number  of  linking  constraints  in  (3.47)]. 

The  minimization  problem  in  (3.51)  is  actually  solved  by  the  ith  subproblem : 

minimize  (cf-y^L j)xj 

subject  to  A jXj  =  b 7,  Xj  >  0  (3.52) 

This  follows  from  the  fact  that  ym+i  is  independent  of  the  extreme  point  index  j 
(since  y  is  fixed  during  the  determination  of  the  rf  s),  and  that  the  solution  of  (3.52) 
must  be  that  extreme  point  of  Si,  say  x^,  of  minimum  cost,  using  the  adjusted  cost 
coefficients  cf  -  y^L j. 

Thus,  an  algorithm  for  this  special  version  of  the  revised  simplex  method  applied 
to  the  master  problem  is  the  following:  Given  a  basis  B 

Step  1.  Calculate  the  current  basic  solution  xB,  and  solve  yrB  =  Cg  for  y. 

Step  2.  For  each  i  =  1,2,  . . . ,  A,  determine  the  optimal  solution  x*  of  the  ith 

subproblem  (3.52)  and  calculate 

A  =  (c J  -  yoL/)x*  -  ym+i.  (3.53) 

If  all  r*  >0,  stop;  the  current  solution  is  optimal. 

Step  3.  Determine  which  column  is  to  enter  the  basis  by  selecting  the  minimal  r*. 
Step  4.  Update  the  basis  of  the  master  problem  as  usual. 

This  algorithm  has  an  interesting  economic  interpretation  in  the  context  of  a 
multidivisional  firm  minimizing  its  total  cost  of  operations  as  described  earlier. 
Division  i  s  activities  are  internally  constrained  by  A x;  =  b j,  and  the  common  res¬ 
ources  bo  impose  linking  constraints.  At  Step  1  of  the  algorithm,  the  firm’s  central 
management  formulates  its  current  master  plan,  which  is  perhaps  suboptimal,  and 
announces  a  new  set  of  prices  that  each  division  must  use  to  revise  its  recommended 
strategy  at  Step  2.  In  particular,  -y0  reflects  the  new  prices  that  higher  management 
has  placed  on  the  common  resources.  The  division  that  reports  the  greatest  rate  of 
potential  cost  improvement  has  its  recommendations  incorporated  in  the  new  mas¬ 
ter  plan  at  Step  3,  and  the  process  is  repeated.  If  no  cost  improvement  is  possible, 
central  management  settles  on  the  current  master  plan. 


(3.51) 
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3.9  Summary 

The  simplex  method  is  founded  on  the  fact  that  the  optimal  value  of  a  linear  pro¬ 
gram,  if  finite,  is  always  attained  at  a  basic  feasible  solution.  Using  this  foundation 
there  are  two  ways  in  which  to  visualize  the  simplex  process.  The  first  is  to  view  the 
process  as  one  of  continuous  change.  One  starts  with  a  basic  feasible  solution  and 
imagines  that  some  nonbasic  variable  is  increased  slowly  from  zero.  As  the  value  of 
this  variable  is  increased,  the  values  of  the  current  basic  variables  are  continuously 
adjusted  so  that  the  overall  vector  continues  to  satisfy  the  system  of  linear  equality 
constraints.  The  change  in  the  objective  function  due  to  a  unit  change  in  this  non¬ 
basic  variable,  taking  into  account  the  corresponding  required  changes  in  the  values 
of  the  basic  variables,  is  the  relative  cost  coefficient  associated  with  the  nonbasic 
variable.  If  this  coefficient  is  negative,  then  the  objective  value  will  be  continuously 
improved  as  the  value  of  this  nonbasic  variable  is  increased,  and  therefore  one  inc¬ 
reases  the  variable  as  far  as  possible,  to  the  point  where  further  increase  would 
violate  feasibility.  At  this  point  the  value  of  one  of  the  basic  variables  is  zero,  and 
that  variable  is  declared  nonbasic,  while  the  nonbasic  variable  that  was  increased  is 
declared  basic. 

The  other  viewpoint  is  more  discrete  in  nature.  Realizing  that  only  basic  feasible 
solutions  need  be  considered,  various  bases  are  selected  and  the  corresponding  basic 
solutions  are  calculated  by  solving  the  associated  set  of  linear  equations.  The  logic 
for  the  systematic  selection  of  new  bases  again  involves  the  relative  cost  coefficients 
and,  of  course,  is  derived  largely  from  the  first,  continuous,  viewpoint. 

Problems  of  special  structure  are  important  both  for  applications  and  for  theory. 
The  transportation  problem  represents  an  important  class  of  linear  programs  with 
structural  properties  that  lead  to  an  efficient  implementation  of  the  simplex  method. 
The  most  important  property  of  the  transportation  problem  is  that  any  basis  is  trian¬ 
gular.  This  means  that  the  basic  variables  can  be  found,  one  by  one,  directly  by  back 
substitution,  and  the  basis  need  never  be  inverted.  Likewise,  the  simplex  multipli¬ 
ers  can  be  found  by  back  substitution,  since  they  solve  a  set  of  equations  involving 
the  transpose  of  the  basis.  Moreover,  when  any  basis  matrix  is  triangular  and  all 
nonzero  elements  are  equal  to  one  (or  minus  one  if  the  signs  of  some  equations 
are  changed),  it  follows  that  the  process  of  back  substitution  will  simply  involve 
repeated  additions  and  subtractions  of  the  given  row  and  column  sums.  No  multi¬ 
plication  or  division  is  required.  It  therefore  follows  that  if  the  original  right-hand 
side  are  integers,  the  values  of  all  basic  variables  will  be  integers.  Hence,  an  opti¬ 
mal  basic  solution,  where  each  entry  is  integral,  always  exists;  that  is,  there  is  no 
gap  between  continuous  linear  program  and  integer  linear  program  (or  the  integral¬ 
ity  gap  is  zero).  The  transportation  problem  can  be  generalized  to  a  minimum  cost 
flow  problem  in  a  network.  This  leads  to  the  interpretation  of  a  simplex  basis  as 
corresponding  to  a  spanning  tree  in  the  network;  see  Appendix  D. 

Many  linear  programming  methods  have  implemented  a  Presolver  procedure  to 
eliminate  redundant  or  duplicate  constraints  and/or  value  fixed  variables,  and  to 
check  possible  constraint  inconsistency  and  unboundedness.  This  typically  results 
in  problem  size  reduction  and  possible  infeasibility  detection. 
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1.  Using  pivoting,  solve  the  simultaneous  equations 

3x\  +  2x2  =  5 
5x\  +  X2  =  9. 

2.  Using  pivoting,  solve  the  simultaneous  equations 

X\  +  2v2  +  X3  =  7 
2x\  -  X2  +  2x3  =  6 
Xl  +  X2  +  3X3  =  12. 


3.  Solve  the  equations  in  Exercise  2  by  Gaussian  elimination  as  described  in 
Appendix  C. 

4.  Suppose  B  is  an  m  x  m  square  nonsingular  matrix,  and  let  the  tableau  T  be 
constructed,  T  =  [I,  B]  where  I  is  the  mxm  identity  matrix.  Suppose  that  pivot 
operations  are  performed  on  this  tableau  so  that  it  takes  the  form  [C,  I].  Show 
that  C  =  B_1. 

5.  Show  that  if  the  vectors  ai,  a2,  . . .,  am  are  a  basis  in  Em ,  the  vectors  ai, 
^2?  •  •  •  ?  a^— i , 

aq,  a^+i,  . . . ,  am  also  are  a  basis  if  and  only  if  apq  ±  0,  where  apq  is  defined  by 
the  tableau  (3.5). 

6.  If  rj  >  0  for  every  j  corresponding  to  a  variable  Xj  that  is  not  basic,  show  that 
the  corresponding  basic  feasible  solution  is  the  unique  optimal  solution. 

7.  Show  that  a  degenerate  basic  feasible  solution  may  be  optimal  without  satisfy¬ 
ing  rj  >  0  for  all  j. 

8. 

(a)  Using  the  simplex  procedure,  solve 

maximize  —x\  +  X2 
subject  to  xi  -  X2  <  2 

Xl  +  X2  <  6 
Xl  >0,  X2  >  0. 

(b)  Draw  a  graphical  representation  of  the  problem  in  xi ,  X2  space  and  indicate  the 
path  of  the  simplex  steps. 

(c)  Repeat  for  the  problem 


maximize  xi  +  X2 
subject  to  -2xi  +  X2  <  1 

xi  -  X2  <  1 

Xi  >0,  X2  >  0. 
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9.  Using  the  simplex  procedure,  solve  the  spare-parts  manufacturer’s  problem 
(Exercise  4,  Chap.  2). 

10.  Using  the  simplex  procedure,  solve 

minimize  2*i  +  4x2  +  -U  +  M 
subject  to  x\  +  3*2  +  *4  <  4 

2*i  +  *2  <3 

*2  +  4*3  +  *4  <  3 
*i>0  i  =  1, 2,  3, 4. 

1 1 .  For  the  linear  program  of  Exercise  10 

(a)  How  much  can  the  elements  of  b  =  (4, 3, 3)  be  changed  without  changing  the 
optimal  basis? 

(b)  How  much  can  the  elements  of  c  =  (2, 4, 1, 1)  be  changed  without  changing  the 
optimal  basis? 

(c)  What  happens  to  the  optimal  cost  for  small  changes  in  b? 

(d)  What  happens  to  the  optimal  cost  for  small  changes  in  c? 

12.  Consider  the  problem 


minimize  *1  -  3*2  -  0.4*3 
subject  to  3*i  -  *2  +  2*3  <  7 
-2*i  +  4*2  <12 

-4*i  +  3*2  +  3*3  <  14 

*1  >0,  X2  >  0,  *3  >  0. 


(a)  Find  an  optimal  solution. 

(b)  How  many  optimal  basic  feasible  solutions  are  there? 

(c)  Show  that  if  C4  +  +  |<224  >  0,  then  another  activity  *4  can  be  introduced 

with  cost  coefficient  c\  and  activity  vector  (a  14,  <224,  <234)  without  changing  the 
optimal  solution. 

13.  Rather  than  select  the  variable  corresponding  to  the  most  negative  relative  cost 
coefficient  as  the  variable  to  enter  the  basis,  it  has  been  suggested  that  a  better 
criterion  would  be  to  select  that  variable  which,  when  pivoted  in,  will  pro¬ 
duce  the  greatest  improvement  in  the  objective  function.  Show  that  this  crite¬ 
rion  leads  to  selecting  the  variable  Xk  corresponding  to  the  index  k  minimizing 
max  rkaio/aik. 

i,aik>  0 

14.  In  the  ordinary  simplex  method  one  new  vector  is  brought  into  the  basis  and 
one  removed  at  every  step.  Consider  the  possibility  of  bringing  two  new  vectors 
into  the  basis  and  removing  two  at  each  stage.  Develop  a  complete  procedure 
that  operates  in  this  fashion. 

15.  Degeneracy .  If  a  basic  feasible  solution  is  degenerate,  it  is  then  theoretically 
possible  that  a  sequence  of  degenerate  basic  feasible  solutions  will  be  generated 
that  endlessly  cycles  without  making  progress.  It  is  the  purpose  of  this  exercise 
and  the  next  two  to  develop  a  technique  that  can  be  applied  to  the  simplex 
method  to  avoid  this  cycling. 
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Corresponding  to  the  linear  system  Ax  =  b  where  A  =  [ai,  a2,  . . . ,  an]  define 
the  perturbed  system  Ax  =  b(s)  where  b(s)  =  b  +  ssl\  +  s2&2  +  •  •  •  +  £”an,  s  > 
0.  Show  that  if  there  is  a  basic  feasible  solution  (possibly  degenerate)  to  the 
unperturbed  system  with  basis  B  =  [ai,  a2,  . . . ,  am],  then  corresponding  to 
the  same  basis,  there  is  a  nondegenerate  basic  feasible  solution  to  the  perturbed 
system  for  some  range  of  s  >  0. 

16.  Show  that  corresponding  to  any  basic  feasible  solution  to  the  perturbed  system 
of  Exercise  15,  which  is  nondegenerate  for  some  range  of  s  >  0,  and  to  a  vector 
&k  not  in  the  basis,  there  is  a  unique  vector  a j  in  the  basis  which  when  replaced 
by  SLk  leads  to  a  basic  feasible  solution;  and  that  solution  is  nondegenerate  for  a 
range  of  s  >  0. 

17.  Show  that  the  tableau  associated  with  a  basic  feasible  solution  of  the  perturbed 
system  of  Exercise  15,  and  which  is  nondegenerate  for  a  range  of  s  >  0,  is 
identical  with  that  of  the  unperturbed  system  except  in  the  column  under  b (s). 
Show  how  the  proper  pivot  in  a  given  column  to  preserve  feasibility  of  the 
perturbed  system  can  be  determined  from  the  tableau  of  the  unperturbed  system. 
Conclude  that  the  simplex  method  will  avoid  cycling  if  whenever  there  is  a 
choice  in  the  pivot  element  of  a  column  k,  arising  from  a  tie  in  the  minimum  of 
aio/aik  among  the  elements  i  e  /o,  the  tie  is  resolved  by  finding  the  minimum 
of  an/aik ,  i  €  Iq.  If  there  still  remainties  among  elements  i  e  /,  the  process  is 
repeated  with  aa/atk,  etc.,  until  there  is  a  unique  element. 

18.  Using  the  two-phase  simplex  procedure  solve 

(a) 

minimize  -3x\  +  X2  +  3^3  -  M 
subject  to  x\  +  2x2  -  X3  +  X4  =  0 
2xi  -  2x2  +  3x3  +  3x4  =  9 

X]  -  X2  +  2x3  -  X4  =  6 

xi  >  0,  i—  1,2, 3, 4. 

(b) 

minimize  x\  +  6x2  -  7x3  +  X4  +  5x5 
subject  to  5xi  -  4x2  +  13x3  -  2x4  +  X5  =  20 

X\  -  X2  +  5X3  “  X4  +  X5  =  8 

xi  >  0,  i  =  1,2, 3.4, 5. 

19.  Solve  the  oil  refinery  problem  (Exercise  3,  Chap.  2). 

20.  Show  that  in  the  phase  I  procedure  of  a  problem  that  has  feasible  solutions,  if  an 
artificial  variable  becomes  nonbasic,  it  need  never  again  be  made  basic.  Thus, 
when  an  artificial  variable  becomes  nonbasic  its  column  can  be  eliminated  from 
future  tableaus. 

21.  Suppose  the  phase  I  procedure  is  applied  to  the  system  Ax  =  b,  x  >  0,  and  that 
the  resulting  tableau  (ignoring  the  cost  row)  has  the  form 
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This  corresponds  to  having  m-k  basic  artificial  variables  at  zero  level. 

Show  that  any  nonzero  element  in  R2  can  be  used  as  a  pivot  to  eliminate  a  basic 
artificial  variable,  thus  yielding  a  similar  tableau  but  with  k  increased  by  one. 
Suppose  that  the  process  in  (a)  has  been  repeated  to  the  point  where  R2  =  0. 
Show  that  the  original  system  is  redundant,  and  show  how  phase  II  may  proceed 
by  eliminating  the  bottom  rows. 

Use  the  above  method  to  solve  the  linear  program 

minimize  2x\  +  6x2  +  X3  +  X4 

subject  to  xi  +  2x2  +  X4  =  6 

xi  +  2v2  +  X3  +  X4  =  7 
x\  +  3x2  —  X3  +  2x4  =  7 
X\  +  X2  +  X3  =5 

x\  >  0,  X2  >  0,  X3  >  0,  X4  >  0. 

22.  Find  a  basic  feasible  solution  to 


(a) 

(b) 

(c) 


x\  +  2x2  _  X3  +  X4  =  3 
2x\  +  4x2  t  X3  +  2x4  =  12 
x\  +  4x2  +  2x3  +  X4  =  9 
xi  ^  0,  i  =  1, 2, 3, 4. 

23.  Consider  the  system  of  linear  inequalities  Ax  ^  b,  x  >  0  with  b  ^  0.  This 
system  can  be  transformed  to  standard  form  by  the  introduction  of  m  surplus 
variables  so  that  it  becomes  Ax-y  =  b,  x  >  0,  y  ^  0.  Let  b k  =  max/  bi  and 
consider  the  new  system  in  standard  form  obtained  by  adding  the  /Th  row  to  the 
negative  of  every  other  row.  Show  that  the  new  system  requires  the  addition  of 
only  a  single  artificial  variable  to  obtain  an  initial  basic  feasible  solution. 

Use  this  technique  to  find  a  basic  feasible  solution  to  the  system. 


x\  +  2x2  +  X3  >  4 
2x\  +  X2  +  X3  ^  5 
2xi  +  3x2  T  2x3  ^  6 
Xj  ^0,  i  =  1, 2,  3. 
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24.  It  is  possible  to  combine  the  two  phases  of  the  two-phase  method  into  a  single 
procedure  by  the  big-M  method.  Given  the  linear  program  in  standard  form 

minimize  cTx 
subject  to  Ax  =  b,  x  ^  0, 

one  forms  the  approximating  problem 


minimize  crx  + 

i=  1 

subject  to  Ax  +  u  =  b 
x  ^  0,  u  >  0. 

In  this  problem  u  =  (u\,  U2 ,  . . . ,  um)  is  a  vector  of  artificial  variables  and  M  is 

m 

a  large  constant.  The  term  M  ^  uj  serves  as  a  penalty  term  for  nonzero  u?  s. 

i=  1 

If  this  problem  is  solved  by  the  simplex  method,  show  the  following: 

(a)  If  an  optimal  solution  is  found  with  y  =  0,  then  the  corresponding  x  is  an 
optimal  basic  feasible  solution  to  the  original  problem. 

(b)  If  for  every  M  >  0  an  optimal  solution  is  found  with  y  ^  0,  then  the  original 
problem  is  infeasible. 

(c)  If  for  every  M  >  0  the  approximating  problem  is  unbounded,  then  the  original 
problem  is  either  unbounded  or  infeasible. 

(d)  Suppose  now  that  the  original  problem  has  a  finite  optimal  value  V(oo).  Let 
V(M)  be  the  optimal  value  of  the  approximating  problem.  Show  that 
V(M)  <  VXoo). 

(e)  Show  that  for  M\  <  M2  we  have  V(M\)  <  V(M2). 

(f)  Show  that  there  is  a  value  Mo  such  that  for  M  ^  Mo,  V(M)  =  V(oo),  and  hence 
conclude  that  the  big-M  method  will  produce  the  right  solution  for  large  enough 
values  of  M. 

25.  Using  the  revised  simplex  method  find  a  basic  feasible  solution  to 

X]  +2x2  ~  V3  +  X4  =  3 

2x\  +\x2  +  V3  +  2^4  =  12 
X\  +4x2  3”  2x3  +  X4  =  9 
x\  ^0,  i  -  1, 2, 3, 4. 

26.  The  following  tableau  is  an  intermediate  stage  in  the  solution  of  a  minimization 
problem: 

yi  y2  ys  y4  ys  y6  yo 

12/3  00  4/3  0  4 
0  -7/3  3  1  -2/3  0  2 

0  -2/3  -2  02/312 
rT  0  8/3  -11  0  4/3  0  -8 
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(a)  Determine  the  next  pivot  element. 

(b)  Given  that  the  inverse  of  the  current  basis  is 

B-1  =  [ai,  a4,  a6]_1  =  1 

and  the  corresponding  cost  coefficients  are 

Cg  =  (A,  c4,  c6)  =  (-1,  -3, 1), 
find  the  original  problem. 

27.  In  many  applications  of  linear  programming  it  may  be  sufficient,  for  practical 
purposes,  to  obtain  a  solution  for  which  the  value  of  the  objective  function  is 
within  a  predetermined  tolerance  s  from  the  minimum  value  z* .  Stopping  the 
simplex  algorithm  at  such  a  solution  rather  than  searching  for  the  true  minimum 
may  considerably  reduce  the  computations. 

(a)  Consider  a  linear  programming  problem  for  which  the  sum  of  the  variables  is 
known  to  be  bounded  above  by  s.  Let  zo  denote  the  current  value  of  the  objective 
function  at  some  stage  of  the  simplex  algorithm,  ( Cj  -  zj )  the  corresponding 
relative  cost  coefficients,  and 


1  1  -1 
1  -2  2 
-1  2  1 


M  =  ma x(zj  -  cj) j. 

Show  that  if  M  <  s/s,  then  zo  -  z*  <  s. 

(b)  Consider  the  transportation  problem  described  in  Sect.  2.2  (Example  3).  Assum¬ 
ing  this  problem  is  solved  by  the  simplex  method  and  it  is  sufficient  to  obtain 
a  solution  within  s  tolerance  from  the  optimal  value  of  the  objective  function, 
specify  a  stopping  criterion  for  the  algorithm  in  terms  of  s  and  the  parameters 
of  the  problem. 

28.  A  matrix  A  is  said  to  be  totally  unimodular  if  the  determinant  of  every  square 
submatrix  formed  from  it  has  value  0,  + 1 ,  or  - 1 

(a)  Show  that  the  matrix  A  defining  the  equality  constraints  of  a  transportation 
problem  is  totally  unimodular. 

(b)  In  the  system  of  equations  Ax  =  b,  assume  that  A  is  totally  unimodular  and 
that  all  elements  of  A  and  b  are  integers.  Show  that  all  basic  solutions  have 
integer  components. 

29.  For  the  arrays  below: 

(a)  Compute  the  basic  solutions  indicated.  (Note:  They  may  be  infeasible.) 

(b)  Write  the  equations  for  the  basic  variables,  corresponding  to  the  indicated 
basic  solutions,  in  lower  triangular  form. 


V 

V 
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30.  For  the  arrays  of  cost  coefficients  below,  the  circled  positions  indicate  basic 
variables. 

(a)  Compute  the  simplex  multipliers. 

(b)  Write  the  equations  for  the  simplex  multipliers  in  upper  triangular  form, 
and  compare  with  Part(b)  of  Exercise  4. 


3  ©  ® 

©  6  ® 

2  ©  3 

2  ©  3 

©  5  © 

1  ©  © 

3 1 .  Consider  the  modified  transportation  problem  where  there  is  more  available  at 

m  n 

origins  than  is  required  at  destinations  (i.e.,  Yuai>  Z  bj). 

i=  1  7=1 


minimize 


m  n 

IZ 


7=1  i=  1 


CijXij 


subject  to 


n 

2_jXij  <  at, 

j=  i 


i  =  1, 2, . . . ,  m 

j  1 , 2, . . . ,  Tl 


Xij  >  0,  for  all  i,  j. 


(a)  Show  how  to  convert  it  to  an  ordinary  transportation  problem. 

(b)  Suppose  there  is  a  storage  cost  of  Si  per  unit  at  origin  i  for  goods  not  trans¬ 
ported  to  a  destination.  Repeat  Part(a)  with  this  assumption. 

32.  Solve  the  following  transportation  problem,  which  is  an  original  example  of 
Hitchcock. 


(25  25  50 ) 

10  5  6  7 

/  \  C  = 

8  2  7  6 

(15  20  30  35) 

9  3  4  8 

33.  In  a  transportation  problem,  suppose  that  two  rows  or  two  columns  of  the  cost 
coefficient  array  differ  by  a  constant.  Show  that  the  problem  can  be  reduced  by 
combining  those  rows  or  columns. 

34.  The  transportation  problem  is  often  solved  more  quickly  by  carefully  selecting 
the  starting  basic  feasible  solution.  The  matrix  minimum  technique  for  finding 
a  starting  solution  is:  (3.34)  Find  the  lowest  cost  unallocated  cell  in  the  array, 
and  allocate  the  maximum  possible  to  it,  (3.35)  Reduce  the  corresponding  row 
and  column  requirements,  and  drop  the  row  or  column  having  zero  remaining 
requirement.  Go  back  to  Step  1  unless  all  remaining  requirements  are  zero. 

(a)  Show  that  this  procedure  yields  a  basic  feasible  solution. 

(b)  Apply  the  method  to  Exercise  7. 
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35.  The  caterer  problem.  A  caterer  is  booked  to  cater  a  banquet  each  evening  for  the 
next  T  days.  He  requires  rt  clean  napkins  on  the  tth  day  for  t  =  1, 2,  . . . ,  T.  He 
may  send  dirty  napkins  to  the  laundry,  which  has  two  speeds  of  service — fast 
and  slow.  The  napkins  sent  to  the  fast  service  will  be  ready  for  the  next  day’s 
banquet;  those  sent  to  the  slow  service  will  be  ready  for  the  banquet  2  days  later. 
Fast  and  slow  service  cost  c\  and  c 2  per  napkin,  respectively,  with  c\  >  C2.  The 
caterer  may  also  purchase  new  napkins  at  any  time  at  cost  cq.  He  has  an  initial 
stock  of  s  napkins  and  wishes  to  minimize  the  total  cost  of  supplying  fresh 
napkins. 

(a)  Formulate  the  problem  as  a  transportation  problem.  (Hint:  Use  T  + 1  sources 
and  T  destinations.) 

(b)  Using  the  values  T  =  4,  s  =  200,  r\  =  100,  r 2  =  130,  r 3  =  150,  r\  - 
140,  c\  -  6,  C2  =  4,  co  =  12,  solve  the  problem. 

36.  The  marriage  assignment  problem.  A  group  of  n  men  and  n  women  live  on  an 
island.  The  amount  of  happiness  that  the  ith  man  and  the  jth  woman  derive  by 
spending  a  fraction  Xij  of  their  lives  together  is  CijXij.  What  is  the  nature  of  the 
living  arrangements  that  maximizes  the  total  happiness  of  the  islanders? 

37.  Anticycling  Rule.  A  remarkably  simple  procedure  for  avoiding  cycling  was 
developed  by  Bland,  and  we  discuss  it  here. 

Bland’s  Rule.  In  the  simplex  method: 

(a)  Select  the  column  to  enter  the  basis  by  j  =  minjj  :  n  <  0};  that  is,  select  the 
lowest  indexed  favorable  column. 

(b)  In  case  ties  occur  in  the  criterion  for  determining  which  column  is  to  leave  the 
basis ,  select  the  one  with  lowest  index. 

We  can  prove  by  contradiction  that  the  use  of  Bland’s  rule  prohibits  cycling. 
Suppose  that  cycling  occurs.  During  the  cycle  a  finite  number  of  columns  enter 
and  leave  the  basis.  Each  of  these  columns  enters  at  level  zero,  and  the  cost 
function  does  not  change. 

Delete  all  rows  and  columns  that  do  not  contain  pivots  during  a  cycle,  obtaining 
a  new  linear  program  that  also  cycles.  Assume  that  this  reduced  linear  program 
has  m  rows  and  n  columns.  Consider  the  solution  stage  where  column  n  is  about 
to  leave  the  basis,  being  replaced  by  column  p.  The  corresponding  tableau  is  as 
follows  (where  the  entries  shown  are  explained  below): 

ai  •  •  •  a p  an  b 

<0  0  0 

<0  0  0 

>0  10 

cT  <0  0  0 

Without  loss  of  generality,  we  assume  that  the  current  basis  consists  of  the  last 
m  columns.  In  fact,  we  may  define  the  reduced  linear  program  in  terms  of  this 
tableau,  calling  the  current  coefficient  array  A  and  the  current  relative  cost  vec¬ 
tor  c.  In  this  tableau  we  pivot  on  amp ,  so  amp  >  0.  By  Part(b)  of  Bland’s  rule, 
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an  can  leave  the  basis  only  if  there  are  no  ties  in  the  ratio  test,  and  since  b  =  0 
because  all  rows  are  in  the  cycle,  it  follows  that  aip  <  0  for  all  i  ±  m. 

Now  consider  the  situation  when  column  n  is  about  to  reenter  the  basis.  Part(a) 
of  Bland’s  rule  ensures  that  Y n  ^  0  and  rj  ^  ^  for  all  i  ±  n.  Apply  the  formula 
ri  =  a  -  yra,  to  the  last  m  columns  to  show  that  each  component  of  y  except  ym 
is  nonpositive;  and  ym  >  0.  Then  use  this  to  show  that  rp  =  cp  -  yTap  <  cp  <  0, 
contradicting  rP>  0. 

38.  Use  the  Dantzig- Wolfe  decomposition  method  to  solve 

minimize  —Ax\  -  X2  -  3x3  -  2x4 
subject  to  2x\  +  2x2  +  X3  +  2x4  <  6 

X2  +  2x3  +  3x4  ^  4 
2xi  +  X2  <5 

X2  <  1 

-  X3  +  2x4  <  2 
X3  +  2x4  <  6 

x\  >  0,  X2  >  0,  X3  >  0,  X4  >  0. 
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state  of  the  art  in  Simplex  solvers  see  Bixby  [B18]. 


Chapter  4 

Duality  and  Complementarity 


Associated  with  every  linear  program,  and  intimately  related  to  it,  is  a  corresponding 
dual  linear  program.  Both  programs  are  constructed  from  the  same  underlying  cost 
and  constraint  coefficients  but  in  such  a  way  that  if  one  of  these  problems  is  one  of 
minimization  the  other  is  one  of  maximization,  and  the  optimal  values  of  the  corre¬ 
sponding  objective  functions,  if  finite,  are  equal.  The  variables  of  the  dual  problem 
can  be  interpreted  as  prices  associated  with  the  constraints  of  the  original  (primal) 
problem,  and  through  this  association  it  is  possible  to  give  an  economically  mean¬ 
ingful  characterization  to  the  dual  whenever  there  is  such  a  characterization  for  the 
primal. 

The  variables  of  the  dual  problem  are  also  intimately  related  to  the  calculation  of 
the  relative  cost  coefficients  in  the  simplex  method.  Thus,  a  study  of  duality  sharp¬ 
ens  our  understanding  of  the  simplex  procedure  and  motivates  certain  alternative 
solution  methods.  Indeed,  the  simultaneous  consideration  of  a  problem  from  both 
the  primal  and  dual  viewpoints  often  provides  significant  computational  advantage 
as  well  as  economic  insight. 


4.1  Dual  Linear  Programs 

In  this  section  we  define  the  dual  program  that  is  associated  with  a  given  linear  pro¬ 
gram.  Initially,  we  depart  from  our  usual  strategy  of  considering  programs  in  stan¬ 
dard  form,  since  the  duality  relationship  is  most  symmetric  for  programs  expressed 
solely  in  terms  of  inequalities.  Specifically  then,  we  define  duality  through  the  pair 
of  programs  displayed  below. 
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Primal 

minimize  cTx 
subject  to  Ax  >  b 
x^O 

If  A  is  an  m  x  n  matrix,  then  x  is  an  m- dimensional  column  vector,  b  is  an 
^-dimensional  column  vector,  cT  is  an  ^-dimensional  row  vector,  and  yT  is  an 
m- dimensional  row  vector.  The  vector  x  is  the  variable  of  the  primal  program,  and  y 
is  the  variable  of  the  dual  program. 

The  pair  of  programs  (4.1)  is  called  the  symmetric  form  of  duality  and,  as  ex¬ 
plained  below,  can  be  used  to  define  the  dual  of  any  linear  program.  It  is  important 
to  note  that  the  role  of  primal  and  dual  can  be  reversed.  Thus,  studying  in  detail 
the  process  by  which  the  dual  is  obtained  from  the  primal:  interchange  of  cost  and 
constraint  vectors,  transposition  of  coefficient  matrix,  reversal  of  constraint  inequal¬ 
ities,  and  change  of  minimization  to  maximization;  we  see  that  this  same  process 
applied  to  the  dual  yields  the  primal.  Put  another  way,  if  the  dual  is  transformed, 
by  multiplying  the  objective  and  the  constraints  by  minus  unity,  so  that  it  has  the 
structure  of  the  primal  (but  is  still  expressed  in  terms  of  y),  its  corresponding  dual 
will  be  equivalent  to  the  original  primal. 

The  dual  of  any  linear  program  can  be  found  by  converting  the  program  to  the 
form  of  the  primal  shown  above.  For  example,  given  a  linear  program  in  standard 
form 

minimize  cTx 
subject  to  Ax  =  b,  x  >  0, 

we  write  it  in  the  equivalent  form 

minimize  cTx 
subject  to  Ax  >  b 

-Ax  >  -b 
x  ^  0, 

which  is  in  the  form  of  the  primal  of  (4.1)  but  with  coefficient  matrix 
a  dual  vector  partitioned  as  (u,  v),  the  corresponding  dual  is 

maximize  uTb  -  vrb 
subject  to  uT  A  -  vr  A  <  cT 
u  >  0,  v  >  0. 

Letting  y  =  u  -  v  we  may  simplify  the  representation  of  the  dual  program  so  that 
we  obtain  the  pair  of  problems  displayed  below: 

Primal  Dual 

minimize  cTx  maximize  yrb  (4.2) 

subject  to  Ax  —  b,  x  ^  0  subject  to  yrA  <  cr. 


A 

-A 


.  Using 


Dual 

maximize  yrb 
subject  to  yrA  <  cT 

y  >  0 


(4.1) 
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This  is  the  asymmetric  form  of  the  duality  relation.  In  this  form  the  dual  vector  y 
(which  is  really  a  composite  of  u  and  v)  is  not  restricted  to  be  nonnegative. 

Similar  transformations  can  be  worked  out  for  any  linear  program  to  first  get  the 
primal  in  the  form  (4.1),  calculate  the  dual,  and  then  simplify  the  dual  to  account 
for  special  structure. 

In  general,  if  some  of  the  linear  inequalities  in  the  primal  (4.1)  are  changed  to 
equality,  the  corresponding  components  of  y  in  the  dual  become  free  variables. 
If  some  of  the  components  of  x  in  the  primal  are  free  variables,  then  the  corre¬ 
sponding  inequalities  in  yrA  <  cT  are  changed  to  equality  in  the  dual.  We  mention 
again  that  these  are  not  arbitrary  rules  but  are  direct  consequences  of  the  original 
definition  and  the  equivalence  of  various  forms  of  linear  programs. 

Example  1  ( Dual  of  the  Diet  Problem).  The  diet  problem,  Example  1,  Sect.  2.2,  was 
the  problem  faced  by  a  dietitian  trying  to  select  a  combination  of  foods  to  meet 
certain  nutritional  requirements  at  minimum  cost.  This  problem  has  the  form 

minimize  cTx 
subject  to  Ax  >  b,  x  >  0 

and  hence  can  be  regarded  as  the  primal  program  of  the  symmetric  pair  above.  We 
describe  an  interpretation  of  the  dual  problem. 

Imagine  a  pharmaceutical  company  that  produces  in  pill  form  each  of  the 
nutrients  considered  important  by  the  dietitian.  The  pharmaceutical  company  tries 
to  convince  the  dietitian  to  buy  pills,  and  thereby  supply  the  nutrients  directly  rather 
than  through  purchase  of  various  foods.  The  problem  faced  by  the  drug  company 
is  that  of  determining  positive  unit  prices  A\,  A2, . . . ,  Am  for  the  nutrients  so  as  to 
maximize  revenue  while  at  the  same  time  being  competitive  with  real  food.  To  be 
competitive  with  real  food,  the  cost  of  a  unit  of  food  i  made  synthetically  from 
pure  nutrients  bought  from  the  druggist  must  be  no  greater  than  c;,  the  market  price 
of  the  food.  Thus,  denoting  by  a/  the  ith  food,  the  company  must  satisfy  yra*  <  C; 
for  each  i.  In  matrix  form  this  is  equivalent  to  yT  A  <  cr.  Since  bj  units  of  the  jth 
nutrient  will  be  purchased,  the  problem  of  the  druggist  is 

maximize  yrb 

subject  to  yrA  <  cT ,  y  >  0, 


which  is  the  dual  problem. 

Example  2  (Dual  of  the  Transportation  Problem).  The  transportation  problem, 
Example  3,  Sect.  2.2,  is  the  problem,  faced  by  a  manufacturer,  of  selecting  the 
pattern  of  product  shipments  between  several  fixed  origins  and  destinations  so  as  to 
minimize  transportation  cost  while  satisfying  demand.  Referring  to  (4.6)  and  (4.7) 
of  Chap.  2,  the  problem  is  in  standard  form,  and  hence  the  asymmetric  version  of 
the  duality  relation  applies.  There  is  a  dual  variable  for  each  constraint.  In  this  case 
we  denote  the  variables  Ui,i  =  1,2,...,  m  for  (4.6)  and  vy,  j  =  1,2,  . . . ,  n  for  (4.7). 
Accordingly,  the  dual  is 
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m  n 

maximize  2  a[U[  +  ^  bjvj 
i=  1  7=1 

subject  to  Ui  +  Vj  <  Cij ,  i  =  1,2,...,  ra, 

j  ~  1?  2,  .  .  .  ,  72. 

To  interpret  the  dual  problem,  we  imagine  an  entrepreneur  who,  feeling  that  he  can 
ship  more  efficiently,  comes  to  the  manufacturer  with  the  offer  to  buy  his  product  at 
the  plant  sites  (origins)  and  sell  it  at  the  warehouses  (destinations).  The  product  price 
that  is  to  be  used  in  these  transactions  varies  from  point  to  point,  and  is  determined 
by  the  entrepreneur  in  advance.  He  must  choose  these  prices,  of  course,  so  that  his 
offer  will  be  attractive  to  the  manufacturer. 

The  entrepreneur,  then,  must  select  prices  -u\,  -U2,  . . . ,  —um  for  the  m  origins 
and  vi ,  V2,  . . . ,  vn  for  the  n  destinations.  To  be  competitive  with  usual  transportation 
modes,  his  prices  must  satisfy  w*  +  Vj  <  Cy  for  all  i,  j ,  since  ui  +  Vj  represents  the 
net  amount  the  manufacturer  must  pay  to  sell  a  unit  of  product  at  origin  i  and  buy 
it  back  again  at  destination  j.  Subject  to  this  constraint,  the  entrepreneur  will  adjust 
his  prices  to  maximize  his  revenue.  Thus,  his  problem  is  as  given  above. 


4.2  The  Duality  Theorem 

To  this  point  the  relation  between  the  primal  and  dual  programs  has  been  simply  a 
formal  one  based  on  what  might  appear  as  an  arbitrary  definition.  In  this  section, 
however,  the  deeper  connection  between  a  program  and  its  dual,  as  expressed  by  the 
Duality  Theorem,  is  derived. 

The  proof  of  the  Duality  Theorem  given  in  this  section  relies  on  the  Separating 
Hyperplane  Theorem  (Appendix  B)  and  is  therefore  somewhat  more  advanced  than 
previous  arguments.  It  is  given  here  so  that  the  most  general  form  of  the  Duality 
Theorem  is  established  directly.  An  alternative  approach  is  to  use  the  theory  of  the 
simplex  method  to  derive  the  duality  result.  A  simplified  version  of  this  alternative 
approach  is  given  in  the  next  section. 

Throughout  this  section  we  consider  the  primal  program  in  standard  form 

minimize  cTx 
subject  to  Ax  =  b,  x  >  0 


and  its  corresponding  dual 


maximize  yrb 

i.  T  (4.4) 

subject  to  y  A  <  c  . 

In  this  section  it  is  not  assumed  that  A  is  necessarily  of  full  rank.  The  following 
lemma  is  easily  established  and  gives  us  an  important  relation  between  the  two 
problems. 
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Dual  values  Primal  values 

^ — ►  z 


Fig.  4.1  Relation  of  primal  and  dual  values 


Lemma  1  (Weak  Duality  Lemma).  Ifx  and  y  are  feasible  for  (4.3)  and  (4. 4),  respectively, 
then  cTx  >  y Tb. 

Proof.  We  have 

yrb  =  yrAx  <  crx, 

the  last  inequality  being  valid  since  x  >  0  and  yrA  <  cT .  I 

This  lemma  shows  that  a  feasible  vector  to  either  problem  yields  a  bound  on  the 
value  of  the  other  problem.  The  values  associated  with  the  primal  are  all  larger  than 
the  values  associated  with  the  dual  as  illustrated  in  Fig.  4.1.  Since  the  primal  seeks 
a  minimum  and  the  dual  seeks  a  maximum,  each  seeks  to  reach  the  other.  From  this 
we  have  an  important  corollary. 

Corollary.  If  xq  and  yo  are  feasible  for  (4.3)  and  (4.4),  respectively,  and  if  ctXq  =  y7  b, 
then  Xq  and  yo  are  optimal  for  their  respective  problems. 

The  above  corollary  shows  that  if  a  pair  of  feasible  vectors  can  be  found  to  the 
primal  and  dual  programs  with  equal  objective  values,  then  these  are  both  optimal. 
The  Duality  Theorem  of  linear  programming  states  that  the  converse  is  also  true, 
and  that,  in  fact,  the  two  regions  in  Fig.  4.1  actually  have  a  common  point;  there  is 
no  “gap.” 

Duality  Theorem  of  Linear  Programming.  If  either  of  the  problems  (4.3)  or  (4.4)  has  a 
finite  optimal  solution,  so  does  the  other,  and  the  corresponding  values  of  the  objective 
functions  are  equal.  If  either  problem  has  an  unbounded  objective,  the  other  problem  has 
no  feasible  solution. 

Proof.  We  note  first  that  the  second  statement  is  an  immediate  consequence  of 
Lemma  1 .  For  if  the  primal  is  unbounded  and  y  is  feasible  for  the  dual,  we  must 
have  yrb  <  —M  for  arbitrarily  large  M,  which  is  clearly  impossible. 

Second  we  note  that  although  the  primal  and  dual  are  not  stated  in  symmetric 
form  it  is  sufficient,  in  proving  the  first  statement,  to  assume  that  the  primal  has 
a  finite  optimal  solution  and  then  show  that  the  dual  has  a  solution  with  the  same 
value.  This  follows  because  either  problem  can  be  converted  to  standard  form  and 
because  the  roles  of  primal  and  dual  are  reversible. 

Suppose  (4.3)  has  a  finite  optimal  solution  with  value  zo-  In  the  space  Em+ 1  define 
the  convex  set 

C  =  {(r,  w)  :  r  =  tzo  -  crx,  w  =  tb  -  Ax,  x  >  0,  t  >  0}. 

It  is  easily  verified  that  C  is  in  fact  a  closed  convex  cone.  We  show  that  the  point  (1, 
0)  is  not  in  C.  If  w  =  t^b  -  Ax0  =  0  with  t0  >  0,  x0  >  0,  then  x  =  x0/fo  is  feasible 
for  (4.3)  and  hence  r/to  =  Zo  -  crx  <  0;  which  means  r  <  0.  If  w  =  -Ax0  =  0 
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with  xo  >  0  and  crxo  =  -1,  and  if  x  is  any  feasible  solution  to  (4.3),  then  x  +  axo  is 
feasible  for  any  a  >  0  and  gives  arbitrarily  small  objective  values  as  a  is  increased. 
This  contradicts  our  assumption  on  the  existence  of  a  finite  optimum  and  thus  we 
conclude  that  no  such  xo  exists.  Hence  (1,0)  ^  C. 

Now  since  C  is  a  closed  convex  set,  there  is  by  Theorem4.4,  Sect.  B.3,  a  hyper¬ 
plane  separating  (1,0)  and  C.  Thus  there  is  a  nonzero  vector  [s,  y]  e  Em+l  and  a 
constant  c  such  that 


s  <  c  =  infjsr  +  y  w  :  (r,  w)  e  C}. 

Now  since  C  is  a  cone,  it  follows  that  c  >  0.  For  if  there  were  (r,  w)eC  such  that 
sr  +  yrw  <  0,  then  cr(r,  w)  for  large  a  would  violate  the  hyperplane  inequality.  On 
the  other  hand,  since  (0,  0)  e  C  we  must  have  c  <  0.  Thus  c  =  0.  As  a  consequence 
s  <  0,  and  without  loss  of  generality  we  may  assume  s  =  - 1. 

We  have  to  this  point  established  the  existence  of  y  e  Em  such  that 

-r  +  yrw  >  0 

for  all  (r,  w)  e  C.  Equivalently,  using  the  definition  of  C, 

(c  -  yrA)x  -  tzo  +  tyrb  >  0 

for  all  x  ^  0,  t  ^  0.  Setting  t  —  0  yields  yrA  <  cr,  which  says  y  is  feasible  for  the 
dual.  Setting  x  =  0  and  t  -  1  yields  yrb  >  zo,  which  in  view  of  Lemma  1  and  its 
corollary  shows  that  y  is  optimal  for  the  dual.  I 


4.3  Relations  to  the  Simplex  Procedure 

In  this  section  the  Duality  Theorem  is  proved  by  making  explicit  use  of  the  char¬ 
acteristics  of  the  simplex  procedure.  As  a  result  of  this  proof  it  becomes  clear  that 
once  the  primal  is  solved  by  the  simplex  procedure  a  solution  to  the  dual  is  readily 
obtainable. 

Suppose  that  for  the  linear  program 

minimize  cTx 
subject  to  Ax  =  b,  x  >  0, 

we  have  the  optimal  basic  feasible  solution  x  =  (xB,  0)  with  corresponding  basis  B. 
We  shall  determine  a  solution  of  the  dual  program 


in  terms  of  B. 


maximize  yrb 
subject  to  yT  A  <  cT 


(4.6) 
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We  partition  A  as  A  =  [B,  D].  Since  the  basic  feasible  solution  xB  =  B  lb  is 
optimal,  the  relative  cost  vector  r  must  be  nonnegative  in  each  component.  From 
Sect.  3.6  we  have 

yt  -  cr  -  crB-1D 

and  since  rB  is  nonnegative  in  each  component  we  have  cbB_1D  <  cB. 

Now  define  yT  =  cbB_1 .  We  show  that  this  choice  of  y  solves  the  dual  problem. 
We  have 

F  t  T  F  t  T  Ft  1  F  t  1  F  t  1  F  t  1  F  t  1  Ft  1 

yrA  =  [yrB,  yrD]  =  [<:£,  CgB^'D]  <  [c„,  c„]  =  cT . 

Thus  since  yT  A  <  cr,  y  is  feasible  for  the  dual.  On  the  other  hand, 

rf  rp  _  1  rp 

y  b  =  cbB  b  =  cbxb, 

and  thus  the  value  of  the  dual  objective  function  for  this  y  is  equal  to  the  value  of 
the  primal  problem.  This,  in  view  of  Lemma  1,  Sect.  4.2,  establishes  the  optimality 
of  y  for  the  dual.  The  above  discussion  yields  an  alternative  derivation  of  the  main 
portion  of  the  Duality  Theorem. 

Theorem.  Let  the  linear  program  (4.5)  have  an  optimal  basic  feasible  solution  correspond¬ 
ing  to  the  basis  B.  Then  the  vector  y  satisfying  yT  =  is  an  optimal  solution  to  the 

dual  program  (4.6).  The  optimal  values  of  both  problems  are  equal. 

We  turn  now  to  a  discussion  of  how  the  solution  of  the  dual  can  be  obtained 
directly  from  the  final  simplex  tableau  of  the  primal.  Suppose  that  embedded  in  the 
original  matrix  A  is  an  m  x  m  identity  matrix.  This  will  be  the  case  if,  for  example, 
m  slack  variables  are  employed  to  convert  inequalities  to  equalities.  Then  in  the 
final  tableau  the  matrix  B-1  appears  where  the  identity  appeared  in  the  beginning. 
Furthermore,  in  the  last  row  the  components  corresponding  to  this  identity  matrix 
will  be  -  cbB_1,  where  Ci  is  the  m-vector  representing  the  cost  coefficients  of 
the  variables  corresponding  to  the  columns  of  the  original  identity  matrix.  Thus  by 
subtracting  these  cost  coefficients  from  the  corresponding  elements  in  the  last  row, 
the  negative  of  the  solution  yT  -  cbB_1  to  the  dual  is  obtained.  In  particular,  if,  as 
is  the  case  with  slack  variables,  Ci  =  0,  then  the  elements  in  the  last  row  under  B-1 
are  equal  to  the  negative  of  components  of  the  solution  to  the  dual. 

Example.  Consider  the  primal  program 

minimize  -x\  -  4x2  ~  3x3 
subject  to  2xi  +  2x2  +  *3^4 
x\  +  2v2  +  2x3  ^  6 
X\  >  0,  X2  >  0,  X3  >  0. 


This  can  be  solved  by  introducing  slack  variables  and  using  the  simplex  proce¬ 
dure.  The  appropriate  sequence  of  tableaus  is  given  below  without  explanation. 
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The  optimal  solution  is  x\  =  0,  X2  =  1,  X3  =  2.  The  corresponding  dual  program  is 

maximize  4A\  +  6A2 
subject  to  2Ai  +  A2  <  -1 
2Ai  +  2A2  ^  — 4 
A\  +  2A2  ^  — 3 
A\  <  0,  A2  ^  0. 


The  optimal  solution  to  the  dual  is  obtained  directly  from  the  last  row  of  the  sim¬ 
plex  tableau  under  the  columns  where  the  identity  appeared  in  the  first  tableau: 
A\  =  —  1 ,  Ai  =  —  1 . 


Geometric  Interpretation 

The  duality  relations  can  be  viewed  in  terms  of  the  dual  interpretations  of  linear 
constraints  emphasized  in  Chap.  3.  Consider  a  linear  program  in  standard  form.  For 
sake  of  concreteness  we  consider  the  problem 

minimize  18xi  +  12^2  +  2x3  +  6x4 
subject  to  3xi  +  X2  -  2x3  +  X4  =  2 
x\  +  3x2  -  X4  =  2 

x\  >  0,  X2  >  0,  X3  >  0,  X4  >  0. 


The  columns  of  the  constraints  are  represented  in  requirements  space  in  Fig.  4.2. 
A  basic  solution  represents  construction  of  b  with  positive  weights  on  two  of  the  a,-’s. 
The  dual  problem  is 


maximize  2Ti  -1-  2A2 

subject  to  3/fi  +  A2  <  18 

A\  +  3T2  ^  12 

— 2A\  ^  2 

A\  —  A2  ^  6. 


The  dual  problem  is  shown  geometrically  in  Fig.  4.3.  Each  column  a/  of  the  pri¬ 
mal  defines  a  constraint  of  the  dual  as  a  half-space  whose  boundary  is  orthogonal 
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Fig.  4.2  The  primal  requirements  space 


to  that  column  vector  and  is  located  at  a  point  determined  by  q.  The  dual  objective 
is  maximized  at  an  extreme  point  of  the  dual  feasible  region.  At  this  point  exactly 
two  dual  constraints  are  active.  These  active  constraints  correspond  to  an  optimal 
basis  of  the  primal.  In  fact,  the  vector  defining  the  dual  objective  is  a  positive  linear 
combination  of  the  vectors.  In  the  specific  example,  b  is  a  positive  combination  of 
ai  and  a2.  The  weights  in  this  combination  are  the  s  in  the  solution  of  the  primal. 


a; 

— * 


a 


i 


► 


Fig.  4.3  The  dual  in  activity  space 
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Simplex  Multipliers 

We  conclude  this  section  by  giving  an  economic  interpretation  of  the  relation 
between  the  simplex  basis  and  the  vector  y.  At  any  point  in  the  simplex  procedure 
we  may  form  the  vector  y  satisfying  yT  =  c^B-1 .  This  vector  is  not  a  solution  to  the 
dual  unless  B  is  an  optimal  basis  for  the  primal,  but  nevertheless,  it  has  an  economic 
interpretation.  Furthermore,  as  we  have  seen  in  the  development  of  the  revised  sim¬ 
plex  method,  this  y  vector  can  be  used  at  every  step  to  calculate  the  relative  cost 
coefficients.  For  this  reason  yT  =  c^B-1,  corresponding  to  any  basis,  is  often  called 
the  vector  of  simplex  multipliers. 

Let  us  pursue  the  economic  interpretation  of  these  simplex  multipliers.  As  usual, 
denote  the  columns  of  A  by  ai,  a2,  . . . ,  an  and  denote  by  ei,  e2, . . . ,  em  the  m  unit 
vectors  in  Em.  The  components  of  the  a,-’s  and  b  tell  how  to  construct  these  vectors 
from  the  e,’s. 

Given  any  basis  B,  however,  consisting  of  m  columns  of  A,  any  other  vector 
can  be  constructed  (synthetically)  as  a  linear  combination  of  these  basis  vectors. 
If  there  is  a  unit  cost  c*  associated  with  each  basis  vector  a /,  then  the  cost  of  a 
(synthetic)  vector  constructed  from  the  basis  can  be  calculated  as  the  corresponding 
linear  combination  of  the  c\  s  associated  with  the  basis.  In  particular,  the  cost  of  the 
yth  unit  vector,  e7,  when  constructed  from  the  basis  B,  is  Aj,  the  yth  component  of 
yT  =  CgB-1.  Thus  the  A/s  can  be  interpreted  as  synthetic  prices  of  the  unit  vectors. 

Now,  any  vector  can  be  expressed  in  terms  of  the  basis  B  in  two  steps:  (1)  express 
the  unit  vectors  in  terms  of  the  basis,  and  then  (2)  express  the  desired  vector  as  a 
linear  combination  of  unit  vectors.  The  corresponding  synthetic  cost  of  a  vector  con¬ 
structed  from  the  basis  B  can  correspondingly  be  computed  directly  by:  (1)  finding 
the  synthetic  price  of  the  unit  vectors,  and  then  (2)  using  these  prices  to  evaluate  the 
cost  of  the  linear  combination  of  unit  vectors.  Thus,  the  simplex  multipliers  can  be 
used  to  quickly  evaluate  the  synthetic  cost  of  any  vector  that  is  expressed  in  terms  of 
the  unit  vectors.  The  difference  between  the  true  cost  of  this  vector  and  the  synthetic 
cost  is  the  relative  cost.  The  process  of  calculating  the  synthetic  cost  of  a  vector, 
with  respect  to  a  given  basis,  by  using  the  simplex  multipliers  is  sometimes  referred 
to  as  pricing  out  the  vector. 

Optimality  of  the  primal  corresponds  to  the  situation  where  every  vector  ai,  a2, 

. . . ,  an  is  cheaper  when  constructed  from  the  basis  than  when  purchased  directly  at 
its  own  price.  Thus  we  have  yra i  <  c i  for  i  =  1,2,  . . . ,  n  or  equivalently  yrA  <  cr. 


4.4  Sensitivity  and  Complementary  Slackness 

The  optimal  values  of  the  dual  variables  in  a  linear  program  can,  as  we  have  seen,  be 
interpreted  as  prices.  In  this  section  this  interpretation  is  explored  in  further  detail. 
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Sensitivity 

Suppose  in  the  linear  program 

minimize  cTx 
subject  to  Ax  =  b,  x  >  0, 

the  optimal  basis  is  B  with  corresponding  solution  (xB,  0),  where  xB  =  B_1b.  A 
solution  to  the  corresponding  dual  is  yT  =  c^B-1. 

Now,  assuming  nondegeneracy,  small  changes  in  the  vector  b  will  not  cause  the 
optimal  basis  to  change.  Thus  for  b  +  Ab  the  optimal  solution  is 


x  =  (xB  +  Axb,  0), 

where  Axb  =  B  l\b.  Thus  the  corresponding  increment  in  the  cost  function  is 

A  z  =  CgAxB  =  yrAb.  (4.8) 

This  equation  shows  that  y  gives  the  sensitivity  of  the  optimal  cost  with  respect  to 
small  changes  in  the  vector  b.  In  other  words,  if  a  new  program  were  solved  with  b 
changed  to  b  +  Ab,  the  change  in  the  optimal  value  of  the  objective  function  would 
be  yrAb. 

This  interpretation  of  the  dual  vector  y  is  intimately  related  to  its  interpretation 
as  a  vector  of  simplex  multipliers.  Since  Aj  is  the  price  of  the  unit  vector  ey-  when 
constructed  from  the  basis  B,  it  directly  measures  the  change  in  cost  due  to  a  change 
in  the  jth  component  of  the  vector  b.  Thus,  Aj  may  equivalently  be  considered  as 
the  marginal  price  of  the  component  bj ,  since  if  bj  is  changed  to  bj  +  A bj  the  value 
of  the  optimal  solution  changes  by  AjAbj. 

If  the  linear  program  is  interpreted  as  a  diet  problem,  for  instance,  then  Aj  is 
the  maximum  price  per  unit  that  the  dietitian  would  be  willing  to  pay  for  a  small 
amount  of  the  yth  nutrient,  because  decreasing  the  amount  of  nutrient  that  must 
be  supplied  by  food  will  reduce  the  food  bill  by  Aj  dollars  per  unit.  If,  as  another 
example,  the  linear  program  is  interpreted  as  the  problem  faced  by  a  manufacturer 
who  must  select  levels  x\ ,  X2,  . . . ,  xn  of  n  production  activities  in  order  to  meet 
certain  required  levels  of  output  b\,  Z?2 ,  . . . ,  bm  while  minimizing  production  costs, 
the  AC s  are  the  marginal  prices  of  the  outputs.  They  show  directly  how  much  the 
production  cost  varies  if  a  small  change  is  made  in  the  output  levels. 


Complementary  Slackness 

The  optimal  solutions  to  primal  and  dual  programs  satisfy  an  additional  relation 
that  has  an  economic  interpretation.  This  relation  can  be  stated  for  any  pair  of  dual 
linear  programs,  but  we  state  it  here  only  for  the  asymmetric  and  the  symmetric 
pairs  defined  in  Sect.  4.1. 
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Theorem.  (Complementary  slackness — asymmetric  form).  Let  x  and  y  be  feasible  solu¬ 
tions  for  the  primal  and  dual  programs,  respectively,  in  the  pair  (4.2).  A  necessary  and 
sufficient  condition  that  they  both  be  optimal  solutions  is  that 1  for  all  i 

i)  Xi  >  0  =>  y Tat  =  Ci 
ii)  Xi  =  0  <=  y Taj  <  cj. 

Proof.  If  the  stated  conditions  hold,  then  clearly  (yrA  -  cr)x  =  0.  Thus  yrb  = 
crx,  and  by  the  corollary  to  Lemma  1,  Sect.  4.2,  the  two  solutions  are  optimal. 
Conversely,  if  the  two  solutions  are  optimal,  it  must  hold,  by  the  Duality  Theo¬ 
rem,  that  yrb  =  cTx  and  hence  that  (yrA  -  cr)x  =  0.  Since  each  component  of  x  is 
nonnegative  and  each  component  of  yT  A  -  cT  is  nonpositive,  the  conditions  (i)  and 
(ii)  must  hold.  I 

Theorem.  (Complementary  slackness — symmetric  form).  Let  x  and  y  be  feasible  solutions 
for  the  primal  and  dual  programs,  respectively,  in  the  pair  (4.1).  A  necessary  and  sufficient 
condition  that  they  both  be  optimal  solutions  is  that  for  all  i  and  j 

i)  V/  >  0  y  3;  —  Ci 

ii)  Xi  =  0  <=  yra,  <  Ci 

iii)  Aj  >  0  =>  aJx  =  bj 

iv)  Aj  =  0  <=  a  x  >  bj, 

(where  aJ  is  the  jth  row  of  A). 

Proof.  This  follows  by  transforming  the  previous  theorem.  I 

The  complementary  slackness  conditions  have  a  rather  obvious  economic  inter¬ 
pretation.  Thinking  in  terms  of  the  diet  problem,  for  example,  which  is  the  primal 
part  of  a  symmetric  pair  of  dual  problems,  suppose  that  the  optimal  diet  supplies 
more  than  bj  units  of  the  jth  nutrient.  This  means  that  the  dietitian  would  be  unwill¬ 
ing  to  pay  anything  for  small  quantities  of  that  nutrient,  since  availability  of  it  would 
not  reduce  the  cost  of  the  optimal  diet.  This,  in  view  of  our  previous  interpretation 
of  Aj  as  a  marginal  price,  implies  Aj  =  0  which  is  (iv)  of  Theorem 4.4.  The  other 
conditions  have  similar  interpretations  which  the  reader  can  work  out. 


4.5  Max  Flow-Min  Cut  Theorem 

One  of  the  most  exemplary  pairs  of  linear  primal  and  dual  problems  is  the  max-flow 
and  min-cut  theorem,  which  we  describe  in  this  section.  The  maximal  flow  problem 
described  in  Chap.  2  can  be  expressed  more  compactly  in  terms  of  the  node-arc 
incidence  matrix  (see  Appendix  D).  Let  x  be  the  vector  of  arc  flows  Xij  (ordered  in 
any  way).  Let  A  be  the  corresponding  node-arc  incidence  matrix.  Finally,  let  e  be  a 


’  The  symbol  =>  means  “implies”  and  <=  means  “is  implied  by.” 


4.5  Max  Flow-Min  Cut  Theorem 


95 


vector  with  dimension  equal  to  the  number  of  nodes  and  having  a  +  1  component  on 
node  1,  a  -  1  on  node  m ,  and  all  other  components  zero.  The  maximal  flow  problem 
is  then 


maximize  / 

subject  to  Ax  -  /e  =  0  (4.9) 

x  <  k. 

The  coefficient  matrix  of  this  problem  is  equal  to  the  node-arc  incidence  matrix  with 
an  additional  column  for  the  flow  variable  /.  Any  basis  of  this  matrix  is  triangular, 
and  hence  as  indicated  by  the  theory  in  the  transportation  problem  in  Chap.  3,  the 
simplex  method  can  be  effectively  employed  to  solve  this  problem.  However,  instead 
of  the  simplex  method,  a  simple  algorithm  based  on  the  tree  algorithm  (also  see 
Appendix  D)  can  be  used. 


Max  Flow  Augmenting  Algorithm 

The  basic  strategy  of  the  algorithm  is  quite  simple.  First  we  recognize  that  it  is 
possible  to  send  nonzero  flow  from  node  1  to  node  m  only  if  node  m  is  reachable 
from  node  1.  The  tree  procedure  can  be  used  to  determine  if  m  is  in  fact  reachable; 
and  if  it  is  reachable,  the  algorithm  will  produce  a  path  from  1  to  m.  By  examining 
the  arcs  along  this  path,  we  can  determine  the  one  with  minimum  capacity.  We  may 
then  construct  a  flow  equal  to  this  capacity  from  1  to  m  by  using  this  path.  This  gives 
us  a  strictly  positive  (and  integer- valued)  initial  flow. 

Next  consider  the  nature  of  the  network  at  this  point  in  terms  of  additional  flows 
that  might  be  assigned.  If  there  is  already  flow  xij  in  the  arc  (i,  j),  then  the  effective 
capacity  of  that  arc  is  reduced  by  ;q/to  kij  -  ),  since  that  is  the  maximal  amount 

of  additional  flow  that  can  be  assigned  to  that  arc.  On  the  other  hand,  the  effective 
reverse  capacity,  on  the  arc  (j,  /),  is  increased  by  Xij( to  kp  +  x^),  since  a  small  incre¬ 
mental  backward  flow  is  actually  realized  as  a  reduction  in  the  forward  flow  through 
that  arc.  Once  these  changes  in  capacities  have  been  made,  the  tree  procedure  can 
again  be  used  to  find  a  path  from  node  1  to  node  m  on  which  to  assign  additional 
flow.  (Such  a  path  is  termed  an  augmenting  path.)  Finally,  if  m  is  not  reachable 
from  1,  no  additional  flow  can  be  assigned,  and  the  procedure  is  complete. 

It  is  seen  that  the  method  outlined  above  is  based  on  repeated  application  of 
the  tree  procedure,  which  is  implemented  by  labeling  and  scanning.  By  including 
slightly  more  information  in  the  labels  than  in  the  basic  tree  algorithm,  the  minimum 
arc  capacity  of  the  augmenting  path  can  be  determined  during  the  initial  scanning, 
instead  of  by  reexamining  the  arcs  after  the  path  is  found.  A  typical  label  at  a  node 
i  has  the  form  (k,  c/),  where  k  denotes  a  precursor  node  and  c*  is  the  maximal  flow 
that  can  be  sent  from  the  source  to  node  i  through  the  path  created  by  the  previous 
labeling  and  scanning.  The  complete  procedure  is  this: 
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Step  0.  Set  all  xij  -  0  and  /  =  0. 

Step  1.  Label  node  1  (-,  oo).  All  other  nodes  are  unlabeled. 

Step  2.  Select  any  labeled  node  i  for  scanning.  Say  it  has  label  ( k ,  c/).  For  all 
unlabeled  nodes  j  such  that  (/,  j)  is  an  arc  with  xij  <  ktj,  assign  the  label  (/,  cj), 
where  cj  =  min  {q,  kij  -  Xij).  For  all  unlabeled  nodes  j  such  that  (j,  i)  is  an  arc 
with  xji  >  0,  assign  the  label  (/,  cj),  where  cj  =  min  {c;,  Xji). 

Step  3.  Repeat  Step  2  until  either  node  m  is  labeled  or  until  no  more  labels  can 
be  assigned.  In  this  latter  case,  the  current  solution  is  optimal. 

Step  4.  (Augmentation.)  If  the  node  m  is  labeled  (/,  cm),  then  increase  /  and 
the  flow  on  arc  (z,  m)  by  cm.  Continue  to  work  backward  along  the  augmenting 
path  determined  by  the  nodes,  increasing  the  flow  on  each  arc  of  the  path  by  cm. 
Return  to  Step  1 . 

The  validity  of  the  algorithm  should  be  fairly  apparent,  that  is,  the  finite  termi¬ 
nation  of  the  algorithm.  However,  a  complete  proof  is  deferred  until  we  consider 
the  max  flow-min  cut  theorem  below. 

Example.  An  example  of  the  above  procedure  is  shown  in  Fig.  4.4.  Node  1  is  the 
source,  and  node  6  is  the  sink.  The  original  network  with  capacities  indicated  on  the 
arcs  is  shown  in  Fig.  4.4a.  Also  shown  in  that  figure  are  the  initial  labels  obtained  by 
the  procedure.  In  this  case  the  sink  node  is  labeled,  indicating  that  a  flow  of  1  unit 
can  be  achieved.  The  augmenting  path  of  this  flow  is  shown  in  Fig.  4.4b.  Numbers 
in  square  boxes  indicate  the  total  flow  in  an  arc.  The  new  labels  are  then  found  and 
added  to  that  figure.  Note  that  node  2  cannot  be  labeled  from  node  1  because  there 
is  no  unused  capacity  in  that  direction.  Node  2  can,  however,  be  labeled  from  node 
4,  since  the  existing  flow  provides  a  reverse  capacity  of  1  unit.  Again  the  sink  is 
labeled,  and  1  unit  more  flow  can  be  constructed.  The  augmenting  path  is  shown  in 
Fig.  4.4c.  A  new  labeling  is  appended  to  that  figure.  Again  the  sink  is  labeled,  and 
an  additional  1  unit  of  flow  can  be  sent  from  source  to  sink.  The  path  of  this  1  unit  is 
shown  in  Fig.  4.4d.  Note  that  it  includes  a  flow  from  node  4  to  node  2,  even  though 
flow  was  not  allowed  in  this  direction  in  the  original  network.  This  flow  is  allowable 
now,  however,  because  there  is  already  flow  in  the  opposite  direction.  The  total  flow 
at  this  point  is  shown  in  Fig.  4.4e.  The  flow  levels  are  again  in  square  boxes.  This 
flow  is  maximal,  since  only  the  source  node  can  be  labeled. 


Max  Flow-Min  Cut  Theorem 

A  great  deal  of  insight  and  some  further  results  can  be  obtained  through  the 
introduction  of  the  notion  of  cuts  in  a  network.  Given  a  network  with  source  node 
1  and  sink  node  m,  divide  the  nodes  arbitrarily  into  two  sets  S  and  S  such  that 
the  source  node  is  in  S  and  the  sink  is  in  S .  The  set  of  arcs  from  S  to  S  is  a  cut  and 
is  denoted  ( S,S ).  The  capacity  of  the  cut  is  the  sum  of  the  capacities  of  the  arcs  in 
the  cut. 

An  example  of  a  cut  is  shown  in  Fig.  4.5.  The  set  S  consists  of  nodes  1  and  2, 
while  S  consists  of  3,  4,  5,  6.  The  capacity  of  this  cut  is  4. 
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Fig.  4.4  Illustration  of  algorithmic  steps  of  the  maximal  flow  example 
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Fig.  4.5  A  cut 


It  should  be  clear  that  a  path  from  node  1  to  node  m  must  include  at  least  one  arc 
in  any  cut,  for  the  path  must  have  an  arc  from  the  set  S  to  the  set  S .  Furthermore,  it 
is  clear  that  the  maximal  amount  of  flow  that  can  be  sent  through  a  cut  is  equal  to 
its  capacity.  Thus  each  cut  gives  an  upper  bound  on  the  value  of  the  maximal  flow 
problem.  The  max  flow-min  cut  theorem  states  that  equality  is  actually  achieved  for 
some  cut.  That  is,  the  maximal  flow  is  equal  to  the  minimal  cut  capacity.  It  should 
be  noted  that  the  proof  of  the  theorem  also  establishes  the  maximality  of  the  flow 
obtained  by  the  maximal  flow  algorithm. 

Max  Flow-Min  Cut  Theorem.  In  a  network  the  maximal  flow  between  a  source  and  a  sink 

is  equal  to  the  minimal  cut  capacity  of  all  cuts  separating  the  source  and  sink. 

Proof.  Since  any  cut  capacity  must  be  greater  than  or  equal  to  the  maximal  flow,  it  is 
only  necessary  to  exhibit  a  flow  and  a  cut  for  which  equality  is  achieved.  Begin  with 
a  flow  in  the  network  that  cannot  be  augmented  by  the  maximal  flow  algorithm.  For 
this  flow  find  the  effective  arc  capacities  of  all  arcs  for  incremental  flow  changes  as 
described  earlier  and  apply  the  labeling  procedure  of  the  maximal  flow  algorithm. 
Since  no  augmenting  path  exists,  the  algorithm  must  terminate  before  the  sink  is 
labeled. 

Let  S  and  S  consist  of  all  labeled  and  unlabeled  nodes,  respectively.  This  defines 
a  cut  separating  the  source  from  the  sink.  All  arcs  originating  in  S  and  terminating 
in  S  have  zero  incremental  capacity,  or  else  a  node  in  S  could  have  been  labeled. 
This  means  that  each  arc  in  the  cut  is  saturated  by  the  original  flow;  that  is,  the 
flow  is  equal  to  the  capacity.  Any  arc  originating  in  S  and  terminating  in  S ,  on  the 
other  hand,  must  have  zero  flow;  otherwise,  this  would  imply  a  positive  incremental 
capacity  in  the  reverse  direction,  and  the  originating  node  in  S  would  be  labeled. 
Thus,  there  is  a  total  flow  from  S  to  S  equal  to  the  cut  capacity,  and  zero  flow  from 
S  to  S .  This  means  that  the  flow  from  source  to  sink  is  equal  to  the  cut  capacity. 
Thus  the  cut  capacity  must  be  minimal,  and  the  flow  must  be  maximal.  I 

In  the  network  of  Fig.  4.4,  the  minimal  cut  corresponds  to  the  S  consisting  only 
of  the  source.  That  cut  capacity  is  3.  Note  that  in  accordance  with  the  max  flow- 
min  cut  theorem,  this  is  equal  to  the  value  of  the  maximal  flow,  and  the  minimal 
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cut  is  determined  by  the  final  labeling  in  Fig.  4.4e.  In  Fig.  4.5  the  cut  shown  is  also 
minimal,  and  the  reader  should  easily  be  able  to  determine  the  pattern  of  maximal 
flow. 


Relation  to  Duality 


The  character  of  the  max  flow-mtn  cut  theorem  suggests  a  connection  with  the 
Duality  Theorem.  We  conclude  this  section  by  exploring  this  connection. 

The  maximal  flow  problem  is  a  linear  program,  which  is  expressed  formally 
by  (4.9).  The  dual  problem  is  found  to  be 


minimize 
subject  to 

When  written  out  in  detail,  the  dual  is 

minimize 

subject  to 


vv'k 

urA  =  wT  (4.10) 

u7e  =  1 
w  >  0. 

Wijkij 

ij 

Ui  ~  Uj  =  Wij 

u\  -  um  =  1  (4.1 1) 

Wij  >  0. 


A  pair  /,  j  is  included  in  the  above  only  if  (/,  j)  is  an  arc  of  the  network. 

A  feasible  solution  to  this  dual  problem  can  be  found  in  terms  of  any  cut 
set  (S,  S).  In  particular,  it  is  easily  seen  that 


_  ( 1  if  ieS 
\0  if  ieS 

[1  if  (i,j)e(S,S) 
7  1  0  otherwise 


(4.12) 


is  a  feasible  solution.  The  value  of  the  dual  problem  corresponding  to  this  solution 
is  the  cut  capacity.  If  we  take  the  cut  set  to  be  the  one  determined  by  the  labeling 
procedure  of  the  maximal  flow  algorithm  as  described  in  the  proof  of  the  theorem 
above,  it  can  be  seen  to  be  optimal  by  verifying  the  complementary  slackness  con¬ 
ditions  (a  task  we  leave  to  the  reader).  The  minimum  value  of  the  dual  is  therefore 
equal  to  the  minimum  cut  capacity. 
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4.6  The  Dual  Simplex  Method 

Often  there  is  available  a  basic  solution  to  a  linear  program  which  is  not  feasible  but 
which  prices  out  optimally;  that  is,  the  simplex  multipliers  are  feasible  for  the  dual 
problem.  In  the  simplex  tableau  this  situation  corresponds  to  having  no  negative  ele¬ 
ments  in  the  bottom  row  but  an  infeasible  basic  solution.  Such  a  situation  may  arise, 
for  example,  if  a  solution  to  a  certain  linear  programming  problem  is  calculated  and 
then  a  new  problem  is  constructed  by  changing  the  vector  b.  In  such  situations  a 
basic  feasible  solution  to  the  dual  is  available  and  hence  it  is  desirable  to  pivot  in 
such  a  way  as  to  optimize  the  dual. 

Rather  than  constructing  a  tableau  for  the  dual  problem  (which,  if  the  primal  is 
in  standard  form;  involves  m  free  variables  and  n  nonnegative  slack  variables),  it  is 
more  efficient  to  work  on  the  dual  from  the  primal  tableau.  The  complete  technique 
based  on  this  idea  is  the  dual  simplex  method.  In  terms  of  the  primal  problem, 
it  operates  by  maintaining  the  optimality  condition  of  the  last  row  while  working 
toward  feasibility.  In  terms  of  the  dual  problem,  however,  it  maintains  feasibility 
while  working  toward  optimality. 

Given  the  linear  program 


minimize  cTx 
subject  to  Ax  =  b,  x  >  0, 

suppose  a  basis  B  is  known  such  that  y  defined  by  yT  =  c^B-1  is  feasible  for  the 
dual.  In  this  case  we  say  that  the  corresponding  basic  solution  to  the  primal,  xb  = 
B-1b,  is  dual  feasible.  If  x#  ^  0  then  this  solution  is  also  primal  feasible  and  hence 
optimal. 

The  given  vector  y  is  feasible  for  the  dual  and  thus  satisfies  yray  <  Cj,  for  j  = 
1,2,  . . . ,  n.  Indeed,  assuming  as  usual  that  the  basis  is  the  first  m  columns  of  A, 
there  is  equality 

yra j  =  cj ,  for  j  =  1, 2,  . . . ,  m,  (4.14a) 

and  (barring  degeneracy  in  the  dual)  there  is  inequality 

yra j  <  cj,  for  j  =  m  +  1,  . . . ,  n.  (4.14b) 


To  develop  one  cycle  of  the  dual  simplex  method,  we  find  a  new  vector  y  such  that 
one  of  the  equalities  becomes  an  inequality  and  one  of  the  inequalities  becomes 
equality,  while  at  the  same  time  increasing  the  value  of  the  dual  objective  function. 
The  m  equalities  in  the  new  solution  then  determine  a  new  basis. 

Denote  the  ith  row  of  B-1  by  u*.  Then  for 

f  =yT-su\  (4.15) 

we  have  yra j  =  yra j  -  £U7a7.  Thus,  recalling  that  Zj  =  yra j  and  noting  that  u*ay  = 
yij,  the  ijth  element  of  the  tableau,  we  have 
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yra j  =  cj,  j  =  1 , 2,  . . . ,  m,  i  +  j  (4. 16a) 

yra  i  =  Ci-s  (4.16b) 

y  slj  =  Zj  ~  eyij,  j  =  m  +  1,  m  +  2,  . . . ,  n.  (4.16c) 

Also, 

yrb  =  yrb  -  exBj.  (4.17) 

These  last  equations  lead  directly  to  the  algorithm: 

Step  1.  Given  a  dual  feasible  basic  solution  xB,  if  xB  >  0  the  solution  is  optimal. 
If  xB  is  not  nonnegative,  select  an  index  i  such  that  the  ith  component  of  xB,  xB;  < 
0. 

Step  2.  If  all  ytj  >  0,  j  =  1,2,  . . . ,  n,  then  the  dual  has  no  maximum  (this  follows 
since  by  (4.16)  A  is  feasible  for  all  s  >  0).  If  ytj  <  0  for  some  j ,  then  let 

:  y,j  <  oj .  (4.18) 

Step  3.  Form  a  new  basis  B  by  replacing  a*  by  a^.  Using  this  basis  determine  the 
corresponding  basic  dual  feasible  solution  xB  and  return  to  Step  1 . 

The  proof  that  the  algorithm  converges  to  the  optimal  solution  is  similar  in  its 
details  to  the  proof  for  the  primal  simplex  procedure.  The  essential  observations  are: 
(a)  from  the  choice  of  k  in  (4.18)  and  from  (4.16a),  (4.16b),  4.16c)  the  new  solution 
will  again  be  dual  feasible;  (b)  by  (4. 17)  and  the  choice  xB/  <  0,  the  value  of  the  dual 
objective  will  increase;  (c)  the  procedure  cannot  terminate  at  a  nonoptimum  point; 
and  (d)  since  there  are  only  a  finite  number  of  bases,  the  optimum  must  be  achieved 
in  a  finite  number  of  steps. 

Example.  A  form  of  problem  arising  frequently  is  that  of  minimizing  a  positive 
combination  of  positive  variables  subject  to  a  series  of  “greater  than”  type  inequal¬ 
ities  having  positive  coefficients.  Such  problems  are  natural  candidates  for  applica¬ 
tion  of  the  dual  simplex  procedure.  The  classical  diet  problem  is  of  this  type  as  is 
the  simple  example  below. 

minimize  3x\  +  4^2  +  5x3 
subject  to  x;  +  2x2  +  3x3  ^5 
2xi  +  2x2  +  X3  ^  6 

X\  >0,  X2  >  0,  X3  >  0. 


By  introducing  surplus  variables  and  by  changing 
obtain  the  initial  tableau 

the  sign  of  the  inequalities 

-1  -2-3  10 

-5 

-©  -2-10  1 

-6 

3  4  5  0  0 

0 

Initial  tableau 

Zk-Ck  .  \Zj-Cj 

so  = - =  mm  < - 

y*  J  {  yij 
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The  basis  corresponds  to  a  dual  feasible  solution  since  all  of  the  cj  -  z/s  are 
nonnegative.  We  select  any  xb,  <  0,  say  V5  =  -6,  to  remove  from  the  set  of  basic 
variables.  To  find  the  appropriate  pivot  element  in  the  second  row  we  compute  the 
ratios  (zj  -  Cj)/y2j  and  select  the  minimum  positive  ratio.  This  yields  the  pivot  indi¬ 
cated.  Continuing,  the  remaining  tableaus  are 

0  -©  -5/2  1  -1/2  -2 

1  11/20  -1/2  3 

0  17/20  3/2  9 

Second  tableau 

0  1  5/2  -1  1/2  2 

10-21-11 
0  0  11  1  11 

Final  tableau 

The  third  tableau  yields  a  feasible  solution  to  the  primal  which  must  be  optimal. 
Thus  the  solution  is  x\  =  1,  X2  =  2,  *3  =  0. 


*4.7  *The  Primal-Dual  Algorithm 


In  this  section  a  procedure  is  described  for  solving  linear  programming  problems  by 
working  simultaneously  on  the  primal  and  the  dual  problems.  The  procedure  begins 
with  a  feasible  solution  to  the  dual  that  is  improved  at  each  step  by  optimizing  an 
associated  restricted  primal  problem.  As  the  method  progresses  it  can  be  regarded 
as  striving  to  achieve  the  complementary  slackness  conditions  for  optimality.  Orig¬ 
inally,  the  primal-dual  method  was  developed  for  solving  a  special  kind  of  linear 
program  arising  in  network  flow  problems,  and  it  continues  to  be  the  most  efficient 
procedure  for  these  problems.  (For  general  linear  programs  the  dual  simplex  method 
is  most  frequently  used).  In  this  section  we  describe  the  generalized  version  of  the 
algorithm  and  point  out  an  interesting  economic  interpretation  of  it.  We  consider  the 
program 

minimize  cTx 

subject  to  Ax  =  b,  x  ^  0 


and  the  corresponding  dual  program 


maximize  yrb 
subject  to  yrA  <  cT . 


(4.20) 


Given  a  feasible  solution  y  to  the  dual,  define  the  subset  P  of  1,  2,  .  .  .  ,  n  by 
i  e  P  if  yra/  =  c*  where  a/  is  the  ith  column  of  A.  Thus,  since  y  is  dual  feasible,  it 
follows  that  i  £  P  implies  yra*  <  c*.  Now  corresponding  to  y  and  P ,  we  define  the 
associated  restricted  primal  problem 
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minimize  lry 
subject  to  Ax  +  y  =  b 

x  >  0,  Xi  —  0  for  i  £  P 

y  >  0, 


(4.21) 


where  1  denotes  the  m- vector  (1,  1,  1). 

The  dual  of  this  associated  restricted  primal  is  called  the  associated  restricted 
dual.  It  is 

maximize  urb 

subject  to  iira;  <  0,  i  i  P  (4.22) 

u  <  1. 


The  condition  for  optimality  of  the  primal-dual  method  is  expressed  in  the  following 
theorem. 

Primal-Dual  Optimality  Theorem.  Suppose  that  y  is  feasible  for  the  dual  and  that  x  and 
y  =  0  is  feasible  (and  of  course  optimal)  for  the  associated  restricted  primal.  Then  x  and  y 
are  optimal  for  the  original  primal  and  dual  programs,  respectively. 

Proof.  Clearly  x  is  feasible  for  the  primal.  Also  we  have  cTx  =  yT Ax,  because  yrA 
is  identical  to  cT  on  the  components  corresponding  to  nonzero  elements  of  x.  Thus 
crx  =  yrAx  =  yrb  and  optimality  follows  from  Lemma  1,  Sect.  4.2. 1 

The  primal-dual  method  starts  with  a  feasible  solution  to  the  dual  and  then 
optimizes  the  associated  restricted  primal.  If  the  optimal  solution  to  this  associated 
restricted  primal  is  not  feasible  for  the  primal,  the  feasible  solution  to  the  dual  is 
improved  and  a  new  associated  restricted  primal  is  determined.  Here  are  the  details: 

Step  1.  Given  a  feasible  solution  yo  to  the  dual  program  (4.20),  determine  the 
associated  restricted  primal  according  to  (4.21). 

Step  2.  Optimize  the  associated  restricted  primal.  If  the  minimal- value  of  this 
problem  is  zero,  the  corresponding  solution  is  optimal  for  the  original  primal 
by  the  Primal-Dual  Optimality  Theorem. 

Step  3.  If  the  minimal  value  of  the  associated  restricted  primal  is  strictly  posi¬ 
tive,  obtain  from  the  final  simplex  tableau  of  the  restricted  primal,  the  solution 
uo  of  the  associated  restricted  dual  (4.22).  If  there  is  no  j  for  which  u^a j  >  0 
conclude  the  primal  has  no  feasible  solutions.  If,  on  the  other  hand,  for  at  least 
one  j,  uj  slj  >  0,  define  the  new  dual  feasible  vector 


y  =  yo  +  £ou0 


where 


£o  = 


ck  y oa£  .  |  cj  y^j 

—  =  min 


uja  k 


j 


<a/ 


U0  a  ,  >  o 


Now  go  back  to  Step  1  using  this  y. 


To  prove  convergence  of  this  method  a  few  simple  observations  and  explanations 
must  be  made.  First  we  verify  the  statement  made  in  Step  3  that  uj  a  j  <  0  for  all  j 
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implies  that  the  primal  has  no  feasible  solution.  The  vector  y£  =  yo  +  £Uo  is  feasible 
for  the  dual  problem  for  all  positive  £,  since  A  <  0.  In  addition,  yjb  =  yj b+£Ug  b 
and,  since  u^b  =  1  Ty  >  0,  we  see  that  as  s  is  increased  we  obtain  an  unbounded 
solution  to  the  dual.  In  view  of  the  Duality  Theorem,  this  implies  that  there  is  no 
feasible  solution  to  the  primal. 

Next  suppose  that  in  Step  3,  for  at  least  one  j,  >  0.  Again  we  define  the 
family  of  vectors  y^  =  yo  +  £Uo-  Since  uo  is  a  solution  to  (4.22)  we  have  uja/  <  0 
for  i  £  P,  and  hence  for  small  positive  s  the  vector  y£  is  feasible  for  the  dual.  We 
increase  s  to  the  first  point  where  one  of  inequalities  y l»j  <  Cj ,  j  £  P  becomes 
an  equality.  This  determines  so  >  0  and  k.  The  new  y  vector  corresponds  to  an  in¬ 
creased  value  of  the  dual  objective  yrb  =  yj b+^Ug  b.  In  addition,  the  corresponding 
new  set  P  now  includes  the  index  k.  Any  other  index  i  that  corresponded  to  a  pos¬ 
itive  value  of  Xi  in  the  associated  restricted  primal  is  in  the  new  set  P,  because  by 
complementary  slackness  u^a /  =  0  for  such  an  i  and  thus  yra,  =  yja*  H-coU^a/  =  q. 
This  means  that  the  old  optimal  solution  is  feasible  for  the  new  associated  restricted 
primal  and  that  a^  can  be  pivoted  into  the  basis.  Since  u^a k  >  0,  pivoting  in  a^  will 
decrease  the  value  of  the  associated  restricted  primal. 

In  summary,  it  has  been  shown  that  at  each  step  either  an  improvement  in  the 
associated  primal  is  made  or  an  infeasibility  condition  is  detected.  Assuming  non¬ 
degeneracy,  this  implies  that  no  basis  of  the  associated  primal  is  repeated — and  since 
there  are  only  a  finite  number  of  possible  bases,  the  solution  is  reached  in  a  finite 
number  of  steps. 

The  primal-dual  algorithm  can  be  given  an  interesting  interpretation  in  terms  of 
the  manufacturing  problem  in  Example  2,  Sect.  2.2.  Suppose  we  own  a  facility  that 
is  capable  of  engaging  in  n  different  production  activities  each  of  which  produces 
various  amounts  of  m  commodities.  Each  activity  i  can  be  operated  at  any  level 
Xi  >  0,  but  when  operated  at  the  unity  level  the  ith  activity  costs  c*  dollars  and  yields 
the  m  commodities  in  the  amounts  specified  by  the  m- vector  a*.  Assuming  linearity 
of  the  production  facility,  if  we  are  given  a  vector  b  describing  output  requirements 
of  the  m  commodities,  and  we  wish  to  produce  these  at  minimum  cost,  ours  is  the 
primal  problem. 

Imagine  that  an  entrepreneur  not  knowing  the  value  of  our  requirements  vector  b 
decides  to  sell  us  these  requirements  directly.  He  assigns  a  price  vector  yo  to  these 
requirements  such  that  yj A  <  c.  In  this  way  his  prices  are  competitive  with  our 
production  activities,  and  he  can  assure  us  that  purchasing  directly  from  him  is  no 
more  costly  than  engaging  activities.  As  owner  of  the  production  facilities  we  are 
reluctant  to  abandon  our  production  enterprise  but,  on  the  other  hand,  we  deem  it  not 
frugal  to  engage  an  activity  whose  output  can  be  duplicated  by  direct  purchase  for 
lower  cost.  Therefore,  we  decide  to  engage  only  activities  that  cannot  be  duplicated 
cheaper,  and  at  the  same  time  we  attempt  to  minimize  the  total  business  volume 
given  the  entrepreneur.  Ours  is  the  associated  restricted  primal  problem. 

Upon  receiving  our  order,  the  greedy  entrepreneur  decides  to  modify  his  prices 
in  such  a  manner  as  to  keep  them  competitive  with  our  activities  but  increase  the 
cost  of  our  order.  As  a  reasonable  and  simple  approach  he  seeks  new  prices  of  the 
form 
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y  =  yo  +  £u0, 

where  he  selects  uo  as  the  solution  to 

maximize  ury 

subject  to  iira;  <  0,  i  e  P 

u  <  1. 


The  first  set  of  constraints  is  to  maintain  competitiveness  of  his  new  price  vector  for 
small  £,  while  the  second  set  is  an  arbitrary  bound  imposed  to  keep  this  subproblem 
bounded.  It  is  easily  shown  that  the  solution  uo  to  this  problem  is  identical  to  the 
solution  of  the  associated  dual  (4.22).  After  determining  the  maximum  s  to  maintain 
feasibility,  he  announces  his  new  prices. 

At  this  point,  rather  than  concede  to  the  price  adjustment,  we  recalculate  the  new 
minimum  volume  order  based  on  the  new  prices.  As  the  greedy  (and  shortsighted) 
entrepreneur  continues  to  change  his  prices  in  an  attempt  to  maximize  profit  he 
eventually  finds  he  has  reduced  his  business  to  zero !  At  that  point  we  have,  with  his 
help,  solved  the  original  primal  problem. 

Example.  To  illustrate  the  primal-dual  method  and  indicate  how  it  can  be  imple¬ 
mented  through  use  of  the  tableau  format  consider  the  following  problem: 

minimize  2x\  +  X2  +  4^3 

subject  to  x\  +  xi  +  2x3  =  3 

2xi  +  X2  +  3x3  =  5 
X\  >  0,  X2  >  0,  X3  >  0. 


Because  all  of  the  coefficients  in  the  objective  function  are  nonnegative,  y  =  (0,  0) 
is  a  feasible  vector  for  the  dual.  We  lay  out  the  simplex  tableau  shown  below 


Ci  ~  yra i 


ai  a2  a3 

1  12  10 
2  13  0  1 

-3  -2-5  0  0 

2  1  4  •  • 

First  tableau 


b 

3 

5 

-8 


To  form  this  tableau  we  have  adjoined  artificial  variables  in  the  usual  manner. 
The  third  row  gives  the  relative  cost  coefficients  of  the  associated  primal  problem — 
the  same  as  the  row  that  would  be  used  in  a  phase  I  procedure.  In  the  fourth  row 
are  listed  the  c*  -  yra*’s  for  the  current  y.  The  allowable  columns  in  the  associated 
restricted  primal  are  determined  by  the  zeros  in  this  last  row. 

Since  there  are  no  zeros  in  the  last  row,  no  progress  can  be  made  in  the  associated 
restricted  primal  and  hence  the  original  solution  x\  =  X2  =  X3  =  0,  y\  =  3,  -  5  is 

optimal  for  this  y.  The  solution  uo  to  the  associated  restricted  dual  is  uo  =  (1, 1),  and 
the  numbers  -uj a,,  i  =  1, 2,  3  are  equal  to  the  first  three  elements  in  the  third  row. 
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Thus,  we  compute  the  three  ratios  |,  |  from  which  we  find  so  =  The  new 

values  for  the  fourth  row  are  now  found  by  adding  so  times  the  (first  three)  elements 
of  the  third  row  to  the  fourth  row. 


ai 

a2 

a3 

• 

• 

b 

1 

© 

2 

1 

0 

3 

2 

1 

3 

0 

1 

5 

-3 

-2 

-5 

0 

0 

-8 

1/2 

0 

3/2 

• 

• 

• 

Second  tableau 


Minimizing  the  new  associated  restricted  primal  by  pivoting  as  indicated  we  obtain 


ai 

a2 

a3 

• 

• 

b 

1 

1 

2 

1 

0 

3 

1 

0 

1 

-1 

1 

2 

-1 

0 

-1 

2 

0 

-2 

-1/2 

0 

3/2 

• 

• 

• 

Now  we  again  calculate  the  ratios  |  obtaining  e0  =  and  add  this  multiple  of 
the  third  row  to  the  fourth  row  to  obtain  the  next  tableau. 


ai  a2  a3  • 

112  10 

©  0  1-11 
-10-1  20 
0  0  1  •  • 

Third  tableau 


b 

3 

2 

-2 


optimizing  the  new  restricted  primal  we  obtain  the  tableau: 


ai  a2  a3 

011  2-1 
101-1  1 
0  0  0  1  1 

0  0  1 

Final  tableau 


b 

1 

2 

0 


Having  obtained  feasibility  in  the  primal,  we  conclude  that  the  solution  is  also 
optimal:  x\  —  2,  X2  =  1,  X3  =  0. 


4.8  Summary 

There  is  a  corresponding  dual  linear  program  associated  with  every  (primal)  linear 
program.  Both  programs  share  the  same  underlying  cost  and  constraint  coefficients. 
We  have  demonstrated  rich  theorems  to  relate  the  pair.  The  variables  of  the  dual 
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problem  can  be  interpreted  as  prices  associated  with  the  constraints  of  the  original 
(primal)  problem,  and  through  this  association  it  is  possible  to  give  an  economically 
meaningful  characterization  to  the  dual  whenever  there  is  such  a  characterization 
for  the  primal. 

Mathematically,  the  pair  also  establish  an  optimality  certificate  to  each  other: 
one  cannot  claim  an  optimal  objective  value  unless  you  find  an  solution  for  the 
dual  to  achieve  the  same  value  of  the  dual  objective.  This  also  leads  to  the  set  of 
optimality  conditions,  including  the  complementarity  conditions,  that  we  would  see 
many  times  in  the  rest  of  the  book. 


4.9  Exercises 

1 .  Verify  in  detail  that  the  dual  of  a  linear  program  is  the  original  problem. 

2.  Show  that  if  a  linear  inequality  in  a  linear  program  is  changed  to  equality,  the 
corresponding  dual  variable  becomes  free. 

3.  Find  the  dual  of 

minimize  cTx 
subject  to  Ax  =  b,  x  ^  a 
where  a  >  0. 

4.  Show  that  in  the  transportation  problem  the  linear  equality  constraints  are  not 
linearly  independent,  and  that  in  an  optimal  solution  to  the  dual  problem  the 
dual  variables  are  not  unique.  Generalize  this  observation  to  any  linear  program 
having  redundant  equality  constraints. 

5.  Construct  an  example  of  a  primal  problem  that  has  no  feasible  solutions  and 
whose  corresponding  dual  also  has  no  feasible  solutions. 

6.  Let  A  be  an  mxn  matrix  and  b  be  an  n-vector.  Prove  that  Ax  <  0  implies  crx  <  0 
if  and  only  if  cT  =  yrA  for  some  y  ^  0.  Give  a  geometric  interpretation  of  the 
result. 

7.  There  is  in  general  a  strong  connection  between  the  theories  of  optimization  and 
free  competition,  which  is  illustrated  by  an  idealized  model  of  activity  location. 
Suppose  there  are  n  economic  activities  (various  factories,  homes,  stores,  etc.) 
that  are  to  be  individually  located  on  n  distinct  parcels  of  land.  If  activity  i  is 
located  on  parcel  j  that  activity  can  yield  Sy  units  (dollars)  of  value. 

If  the  assignment  of  activities  to  land  parcels  is  made  by  a  central  authority,  it 
might  be  made  in  such  a  way  as  to  maximize  the  total  value  generated.  In  other 
words,  the  assignment  would  be  made  so  as  to  maximize  Z/  Z/  sijxij  where 

(1  if  activity  i  is  assigned  to  parcel  j 
0  otherwise 
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More  explicitly  this  approach  leads  to  the  optimization  problem 


maximize  X  X  sijxij 
i  j 

subject  to  X  xij  ~  i  =  1,  2,  . . . ,  n 

j 

=  1,  j  =  1,2,  . . n 

i 

>  0,  Xij  =  0  or  1 . 


Actually,  it  can  be  shown  that  the  final  requirement  (Xy  =  0  or  1  )  is  automati¬ 
cally  satisfied  at  any  extreme  point  of  the  set  defined  by  the  other  constraints,  so 
that  in  fact  the  optimal  assignment  can  be  found  by  using  the  simplex  method 
of  linear  programming. 

If  one  considers  the  problem  from  the  viewpoint  of  free  competition,  it  is 
assumed  that,  rather  than  a  central  authority  determining  the  assignment,  the 
individual  activities  bid  for  the  land  and  thereby  establish  prices. 

(a)  Show  that  there  exists  a  set  of  activity  prices  pi ,  i  =  1,2,  . . . ,  n  and  land 
prices  qj,  j  =  1,2,  . . . ,  n  such  that 

Pi  +  qj  >  Sy ,  i  =  1, 2,  . . . ,  n,  j  =  1, 2,  . . . ,  n 

with  equality  holding  if  in  an  optimal  assignment  activity  i  is  assigned  to 
parcel  j. 

(b)  Show  that  Part  (a)  implies  that  if  activity  i  is  optimally  assigned  to  parcel  j 
and  if  f  is  any  other  parcel 


Give  an  economic  interpretation  of  this  result  and  explain  the  relation 
between  free  competition  and  optimality  in  this  context. 

(c)  Assuming  that  each  Sy  is  positive,  show  that  the  prices  can  all  be  assumed 
to  be  nonnegative. 


8.  Construct  the  dual  of  the  combinatorial  auction  problem  of  Example  7  of 
Chap.  2,  and  give  an  economical  interpretation  for  each  type  of  the  dual  vari¬ 
ables. 

9.  Game  theory  is  in  part  related  to  linear  programming  theory.  Consider  the  game 
in  which  player  X  may  select  any  one  of  m  moves,  and  player  Y  may  select  any 
one  of  n  moves.  If  X  selects  i  and  Y  selects  j,  then  X  wins  an  amount  ay  from  Y. 
The  game  is  repeated  many  times.  Player  X  develops  a  mixed  strategy  where  the 
various  moves  are  played  according  to  probabilities  represented  by  the  compo¬ 
nents  of  the  vector  x  =  (x\,  X2,  . . . ,  xm),  where  x\  >  0,  i  -  1,2,  . . . ,  m 

m 

and  X  xi  ~  1-  Likewise  Y  develops  a  mixed  strategy  y  =  (yi,  y 2,  . . .,  yn), 

i=  1 


n 

where  yt  >  0,  i  =  1, 2,  . . . ,  n  and  X  yi  -  1-  The  average  payoff  to  X  is  then 

i=  1 


P(x,  y)  =  xT Ay. 
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(a)  Suppose  X  selects  x  as  the  solution  to  the  linear  program 

maximize  A 

m 

subject  to  Y  *i  -  1 

i=i 

m 

Z  XiOg  >  A,  j  =  1,2,  . . . ,  n 

i—  1 

Xi  >  0,  /  =  1, 2,  . . . ,  m. 

Show  that  X  is  guaranteed  a  payoff  of  at  least  A  no  matter  what  y  is  chosen 
by  Y. 

(b)  Show  that  the  dual  of  the  problem  above  is 

minimize  B 

n 

subject  to  Yyj  -  1 

7=1 

n 

E  flyVy  <  B,  i=  1,2,  m 

7=1 

y,-  >  0,  j  =  1,2,  n. 

(c)  Prove  that  max  A  =  min  (The  common  value  is  called  the  va/we  of  the 
game.) 

(d)  Consider  the  “matching”  game.  Each  player  selects  heads  or  tails.  If  the 
choices  match,  X  wins  $  1  from  Y ;  if  they  do  not  match,  Y  wins  $  1  from  X. 
Find  the  value  of  this  game  and  the  optimal  mixed  strategies. 

(e)  Repeat  Part  (d)  for  the  game  where  each  player  selects  either  1,  2,  or  3. 
The  player  with  the  highest  number  wins  $  1  unless  that  number  is  exactly 
1  higher  than  the  other  player’s  number,  in  which  case  he  loses  $3.  When 
the  numbers  are  equal  there  is  no  payoff. 

10.  Consider  the  primal  linear  program  in  the  standard  form.  Suppose  that  this 
program  and  its  dual  are  feasible.  Let  y  be  a  known  optimal  solution  to  the 
dual. 

(a)  If  the  kth  equation  of  the  primal  is  multiplied  by  /i  ^  0,  determine  an  opti¬ 
mal  solution  w  to  the  dual  of  this  new  problem. 

(b)  Suppose  that,  in  the  original  primal,  we  add  /i  times  the  Mi  equation  to 
the  rth  equation.  What  is  an  optimal  solution  w  to  the  corresponding  dual 
problem? 

(c)  Suppose,  in  the  original  primal,  we  add  /i  times  the  Mi  row  of  A  to  c.  What 
is  an  optimal  solution  to  the  corresponding  dual  problem? 

1 1 .  Consider  the  linear  program  (P)  of  the  form 

minimize  qrz 

subject  to  Mz  >  -q,  z  >  0 

in  which  the  matrix  M  is  skew  symmetric ;  that  is,  M  =  -Mr. 
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(a)  Show  that  problem  (P)  and  its  dual  are  the  same. 

(b)  A  problem  of  the  kind  in  part  (a)  is  said  to  be  self-dual.  An  example  of  a 
self-dual  problem  has 


0 

-Ar 

c 

X 

A 

0 

>  q  = 

-b 

,  z  = 

Give  an  interpretation  of  the  problem  with  this  data. 

(c)  Show  that  a  self-dual  linear  program  has  an  optimal  solution  if  and  only  if 
it  is  feasible. 

12.  A  company  may  manufacture  n  different  products,  each  of  which  uses  various 
amounts  of  m  limited  resources.  Each  unit  of  product  i  yields  a  profit  of  c* 
dollars  and  uses  units  of  the  jth  resource.  The  available  amount  of  the  jth 
resource  is  bj.  To  maximize  profit  the  company  selects  the  quantities  X(  to  be 
manufactured  of  each  product  by  solving 

maximize  cTx 
subject  to  Ax  =  b,  x  >  0. 

The  unit  profits  c*  already  take  into  account  the  variable  cost  associated  with 
manufacturing  each  unit.  In  addition  to  that  cost,  the  company  incurs  a  fixed 
overhead  H ,  and  for  accounting  purposes  it  wants  to  allocate  this  overhead  to 
each  of  its  products.  In  other  words,  it  wants  to  adjust  the  unit  profits  so  as  to 
account  for  the  overhead.  Such  an  overhead  allocation  scheme  must  satisfy  two 
conditions:  (4.1)  Since  H  is  fixed  regardless  of  the  product  mix,  the  overhead 
allocation  scheme  must  not  alter  the  optimal  solution,  (4.2)  All  the  overhead 
must  be  allocated;  that  is,  the  optimal  value  of  the  objective  with  the  modified 
cost  coefficients  must  be  H  dollars  lower  than  z — the  original  optimal  value  of 
the  objective. 

(a)  Consider  the  allocation  scheme  in  which  the  unit  profits  are  modified 
according  to  cT  =  cr  -ryj A,  where  yo  is  the  optimal  solution  to  the  original 
dual  and  r  =  H/zo  (assume  H  <  zq)- 

(i)  Show  that  the  optimal  x  for  the  modified  problem  is  the  same  as  that 
for  the  original  problem,  and  the  new  dual  solution  is  yo  =  (1  -  r) yo. 

(ii)  Show  that  this  approach  fully  allocates  H. 

(b)  Suppose  that  the  overhead  can  be  traced  to  each  of  the  resource  constraints. 
Let  Hi  >  0  be  the  amount  of  overhead  associated  with  the  ith  resource, 

m 

where  2  Hi  <  Zo  and  rt  =  Hi/bi  <  4°  for  i  =  1,  . . . ,  m.  Based  on  this 

i=i 

information,  an  allocation  scheme  has  been  proposed  where  the  unit  profits 
are  modified  such  that  cT  =  cT  -  rT A. 

(i)  Show  that  the  optimal  x  for  this  modified  problem  is  the  same  as  that  for 
the  original  problem,  and  the  corresponding  dual  solution  is  yo  =  yo-r. 

(ii)  Show  that  this  scheme  fully  allocates  H. 
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13.  Solve  the  linear  inequalities 

—2xi  +  2x2  <  -1 
2x\  -  X2  <  2 
-  4*2  <  3 

-15*i  -  12jc2  <  ~2 

12*i  +  20*2  <  -1. 

Note  that  *i  and  *2  are  not  restricted  to  be  positive.  Solve  this  problem  by 
considering  the  problem  of  maximizing  0  •  *1  +  0  •  *2  subject  to  these  constraints, 
taking  the  dual  and  using  the  simplex  method. 

14.  (a)  Using  the  simplex  method  solve 

minimize  2*i  -  *2 
subject  to  2*i  -  *2  -  V3  >3 

*1  -  *2  +  *3  ^  2 

Xi  >  0,  i  =  1, 2, 3. 

(Hint:  Note  that  *1  =  2  gives  a  feasible  solution.) 

(b)  What  is  the  dual  problem  and  its  optimal  solution? 

15.  (a)  Using  the  simplex  method  solve 

minimize  2*i  +  3*2  +  2*3  +  2*4 
subject  to  *1  +  2*2  +  *3  +  2*4  =  3 

*1  +  *2  +  2*3  +  4*4  —  5 

Xi  ^  0,  i  =  1, 2, 3, 4. 

(b)  Using  the  work  done  in  Part  (a)  and  the  dual  simplex  method,  solve  the 
same  problem  but  with  the  right-hand  sides  of  the  equations  changed  to  8 
and  7  respectively. 

16.  For  the  problem 

minimize  5*i  +  3*2 

subject  to  2*i  -  *2  +  4*3  <  4 

*1  +  *2  +  2*3  <  5 

2*i  -  *2  +  *3  >  l 

*1  >  0,  x2>  0,  *3  >  0; 

(a)  Using  a  single  pivot  operation  with  pivot  element  1 ,  find  a  feasible  solution. 

(b)  Using  the  simplex  method,  solve  the  problem. 

(c)  What  is  the  dual  problem? 

(d)  What  is  the  solution  to  the  dual? 
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17.  Solve  the  following  problem  by  the  dual  simplex  method: 

minimize  -lx\  +1x2  ~  2^3  —  X4  —  6x5 
subject  to  3xi  -  X2  +  X3  -  2x4  =  -3 

2x\  +  X2  +  X4  +  X5  =  4 
— X]  +  3X2  -  3X4  +  X6  =12 

and  x,  >  0,  i  =  1, . . . ,  6. 

18.  Given  the  linear  programming  problem  in  standard  form  (4.3)  suppose  a  basis  B 
and  the  corresponding  (not  necessarily  feasible)  primal  and  dual  basic  solutions 
x  and  y  are  known.  Assume  that  at  least  one  relative  cost  coefficient  c*  -  yra,  is 
negative.  Consider  the  auxiliary  problem 

minimize  cTx 
subject  to  Ax  =  b 

Tj  xi  +  y  =  M 

ieT 

x  >  0,  y  >  0, 

where  T  =  {/  :  c*  -  yra*  <  0},  y  is  a  slack  variable,  and  M  is  a  large  positive 
constant.  Show  that  if  k  is  the  index  corresponding  to  the  most  negative  rela¬ 
tive  cost  coefficient  in  the  original  solution,  then  (y,  c >  -  yra^)  is  dual  feasible 
for  the  auxiliary  problem.  Based  on  this  observation,  develop  a  big-M  artificial 
constraint  method  for  the  dual  simplex  method.  (Refer  to  Exercise  24,  Chap.  3.) 

19.  A  textile  firm  is  capable  of  producing  three  products — xi,  X2,  X3.  Its  production 
plan  for  next  month  must  satisfy  the  constraints 

X\  +  2x2  4-  2x3  <  12 
2xi  +  4x2  +  X3  <  / 
x\  >  0,  X2  >  0,  X3  >  0. 

The  first  constraint  is  determined  by  equipment  availability  and  is  fixed.  The 
second  constraint  is  determined  by  the  availability  of  cotton.  The  net  profits  of 
the  products  are  2,  3,  and  3,  respectively,  exclusive  of  the  cost  of  cotton  and 
fixed  costs. 

(a)  Find  the  shadow  price  A2  of  the  cotton  input  as  a  function  of  /.  (Hint:  Use 
the  dual  simplex  method.)  Plot  22(f)  and  the  net  profit  z(f)  exclusive  of  the 
cost  for  cotton. 

(b)  The  firm  may  purchase  cotton  on  the  open  market  at  a  price  of  1/6.  How¬ 
ever,  it  may  acquire  a  limited  amount  at  a  price  of  1/12  from  a  major  sup¬ 
plier  that  it  purchases  from  frequently.  Determine  the  net  profit  of  the  firm 
n(s)  as  a  function  of  s. 
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20.  A  certain  telephone  company  would  like  to  determine  the  maximum  number 
of  long-distance  calls  from  Westburgh  to  Eastville  that  it  can  handle  at  any  one 
time.  The  company  has  cables  linking  these  cities  via  several  intermediary  cities 
as  follows: 


Nortlieate  North  bay 

C  Jr 


Each  cable  can  handle  a  maximum  number  of  calls  simultaneously  as  indicated 
in  the  figure.  For  example,  the  number  of  calls  routed  from  Westburgh  to  North- 
gate  cannot  exceed  five  at  any  one  time.  A  call  from  Westburgh  to  Eastville  can 
be  routed  through  any  other  city,  as  long  as  there  is  a  cable  available  that  is  not 
currently  being  used  to  its  capacity.  In  addition  to  determining  the  maximum 
number  of  calls  from  Westburgh  to  Eastville,  the  company  would,  of  course, 
like  to  know  the  optimal  routing  of  these  calls.  Assume  calls  can  be  routed  only 
in  the  directions  indicated  by  the  arrows. 

(a)  Formulate  the  above  problem  as  a  linear  programming  problem  with  upper 
bounds. 

(Hint:  Denote  by  Xy  the  number  of  calls  routed  from  city  i  to  city  j.) 

(b)  Find  the  solution  by  inspection  of  the  graph. 

21 .  Apply  the  maximal  flow  algorithm  to  the  network  below.  All  arcs  have  capacity 
1  unless  otherwise  indicated. 


22.  Consider  the  problem 

minimize  2x\  +  X2  +  4x3 

subject  to  x\  +  X2  +  2x3  =  3 

2xi  +  X2  +  3x3  =  5 
Xi  >  0,  X2  >  0,  X3  >  0. 
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4  Duality  and  Complementarity 


(a)  What  is  the  dual  problem? 

(b)  Note  that  y  =  (1, 0)  is  feasible  for  the  dual.  Starting  with  this  y,  solve  the 
primal  using  the  primal-dual  algorithm. 

23.  Show  that  in  the  associated  restricted  dual  of  the  primal-dual  method  the 
objective  yrb  can  be  replaced  by  yTy. 

24.  Consider  the  primal  feasible  region  in  standard  form  Ax  =  b,  x  >  0,  where  A  is 
an  m  x  n  matrix,  b  is  a  constant  nonzero  ra-vector,  and  x  is  a  variable  n-vector. 

(a)  A  variable  Xi  is  said  to  be  a  null  variable  if  Xi  -  0  in  every  feasible  solution. 
Prove  that,  if  the  feasible  region  is  non-empty,  jq  is  a  null  variable  if  and  only 
if  there  is  a  nonzero  vector  y  e  Em  such  that  yrA  >  0,  yrb  =  0.  and  the  ith 
component  of  yrA  is  strictly  positive. 

(b ) Strict  complementarity  Let  the  feasible  region  be  nonempty.  Then  there  is  a 
feasible  x  and  vector  y  e  Em  such  that 

yrA  >  0,  yrb  =  0,  yrb  +  x  >  0. 

(c) A  variable  xt  is  a  nonextremal  variable  if  jcz-  >  0  in  every  feasible  solution. 
Prove  that,  if  the  feasible  region  is  non-empty,  xt  is  a  nonextremal  variable 
if  and  only  if  there  is  y  e  Em  and  d  e  En  such  that  yrA  =  dr,  where 
di  =  - 1,  dj  >  0  for  j  ±  i\  and  such  that  yrb  <  0. 
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4. 1-4.4  Again  most  of  the  material  in  this  chapter  is  now  quite  standard.  See  the 
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Ford  and  Fulkerson  [FI 3].  For  discussion  of  even  more  efficient  versions  of 
the  maximal  flow  algorithm,  see  Lawler  [L2]  and  Papadimitriou  and  Stei- 
glitz  [P2] .  The  Hungarian  method  for  the  assignment  problem  was  designed 
by  Kuhn  [K10].  It  is  called  the  Hungarian  method  because  it  was  based  on 
work  by  the  Hungarian  mathematicians  Egervary  and  Konig.  Ultimately, 
this  led  to  the  general  primal-dual  algorithm  for  linear  programming. 

4.6  The  dual  simplex  method  is  due  to  Lemke  [L4]. 

4.7  The  general  primal-dual  algorithm  is  due  to  Dantzig,  Ford  and  Fulker¬ 
son  [D7].  See  also  Ford  and  Fulkerson  [FI 3].  The  economic  interpretation 
given  in  this  section  is  apparently  novel. 

The  concepts  of  reduction  are  due  to  Shefi  [S5],  who  has  developed  a 
complete  theory  in  this  area.  For  more  details  along  the  lines  presented 
here,  see  Luenberger  [LI 5]. 


Chapter  5 

Interior-Point  Methods 


Linear  programs  can  be  viewed  in  two  somewhat  complementary  ways.  They  are, 
in  one  view,  a  class  of  continuous  optimization  problems  each  with  continuous  vari¬ 
ables  defined  on  a  convex  feasible  region  and  with  a  continuous  objective  function. 
They  are,  therefore,  a  special  case  of  the  general  form  of  problem  considered  in 
this  text.  However,  linearity  implies  a  certain  degree  of  degeneracy,  since  for  exam¬ 
ple  the  derivatives  of  all  functions  are  constants  and  hence  the  differential  methods 
of  general  optimization  theory  cannot  be  directly  used.  From  an  alternative  view, 
linear  programs  can  be  considered  as  a  class  of  combinatorial  problems  because  it 
is  known  that  solutions  can  be  found  by  restricting  attention  to  the  vertices  of  the 
convex  polyhedron  defined  by  the  constraints.  Indeed,  this  view  is  natural  when  con¬ 
sidering  network  problems  such  as  those  of  early  chapters.  However,  the  number  of 
vertices  may  be  large,  up  to  n\/m\(n-m) !,  making  direct  search  impossible  for  even 
modest  size  problems. 

The  simplex  method  embodies  both  of  these  viewpoints,  for  it  restricts  attention 
to  vertices,  but  exploits  the  continuous  nature  of  the  variables  to  govern  the  progress 
from  one  vertex  to  another,  defining  a  sequence  of  adjacent  vertices  with  improving 
values  of  the  objective  as  the  process  reaches  an  optimal  point.  The  simplex  method, 
with  ever-evolving  improvements,  has  for  five  decades  provided  an  efficient  general 
method  for  solving  linear  programs. 

Although  it  performs  well  in  practice,  visiting  only  a  small  fraction  of  the  total 
number  of  vertices,  a  definitive  theory  of  the  simplex  method’s  performance  was 
unavailable.  However,  in  1972,  Klee  and  Minty  showed  by  examples  that  for  certain 
linear  programs  the  simplex  method  will  examine  every  vertex  f.  These  examples 
proved  that  in  the  worst  case,  the  simplex  method  requires  a  number  of  steps  that  is 
exponential  in  the  size  of  the  problem. 
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In  view  of  this  result,  many  researchers  believed  that  a  good  algorithm,  differ¬ 
ent  than  the  simplex  method,  might  be  devised  whose  number  of  steps  would  be 
polynomial  rather  than  exponential  in  the  program’s  size — that  is,  the  time  required 
to  compute  the  solution  would  be  bounded  above  by  a  polynomial  in  the  size  of  the 
problem. 

Indeed,  in  1979,  a  new  approach  to  linear  programming,  Khachiyan’s  ellipsoid 
method  was  announced  with  great  acclaim.  The  method  is  quite  different  in  struc¬ 
ture  than  the  simplex  method,  for  it  constructs  a  sequence  of  shrinking  ellipsoids 
each  of  which  contains  the  optimal  solution  set  and  each  member  of  the  sequence 
is  smaller  in  volume  than  its  predecessor  by  at  least  a  certain  fixed  factor.  There¬ 
fore,  the  solution  set  can  be  found  to  any  desired  degree  of  approximation  by  con¬ 
tinuing  the  process.  Khachiyan  proved  that  the  ellipsoid  method,  developed  dur¬ 
ing  the  1970s  by  other  mathematicians,  is  a  polynomial-time  algorithm  for  linear 
programming. 

Practical  experience,  however,  was  disappointing.  In  almost  all  cases,  the  simplex 
method  was  much  faster  than  the  ellipsoid  method.  However,  Khachiyan’s  ellipsoid 
method  showed  that  polynomial  time  algorithms  for  linear  programming  do  exist. 
It  left  open  the  question  of  whether  one  could  be  found  that,  in  practice,  was  faster 
than  the  simplex  method. 

It  is  then  perhaps  not  surprising  that  the  announcement  by  Karmarkar  in  1984 
of  a  new  polynomial  time  algorithm,  an  interior- point  method,  with  the  potential 
to  improve  the  practical  effectiveness  of  the  simplex  method  made  front-page  news 
in  major  newspapers  and  magazines  throughout  the  world.  It  is  this  interior-point 
approach  that  is  the  subject  of  this  chapter  and  the  next. 

This  chapter  begins  with  a  brief  introduction  to  complexity  theory,  which  is  the 
basis  for  a  way  to  quantify  the  performance  of  iterative  algorithms,  distinguishing 
polynomial-time  algorithms  from  others. 

Next  the  example  of  Klee  and  Minty  showing  that  the  simplex  method  is  not 
a  polynomial- time  algorithm  in  the  worst  case  is  presented.  Following  that  the 
ellipsoid  algorithm  is  defined  and  shown  to  be  a  polynomial-time  algorithm.  These 
two  sections  provide  a  deeper  understanding  of  how  the  modern  theory  of  linear 
programming  evolved,  and  help  make  clear  how  complexity  theory  impacts  linear 
programming.  However,  the  reader  may  wish  to  consider  them  optional  and  omit 
them  at  first  reading. 

The  development  of  the  basics  of  interior-point  theory  begins  with  Sect.  5.4 
which  introduces  the  concept  of  barrier  functions  and  the  analytic  center.  Section  5.5 
introduces  the  central  path  which  underlies  interior-point  algorithms.  The  relations 
between  primal  and  dual  in  this  context  are  examined.  An  overview  of  the  details 
of  specific  interior-point  algorithms  based  on  the  theory  are  presented  in  Sects.  5.6 
and  5.7 


1  We  will  be  more  precise  about  complexity  notions  such  as  “polynomial  algorithm”  in  Sect.  5.1 
below. 


5. 1  Elements  of  Complexity  Theory 
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5.1  Elements  of  Complexity  Theory 

Complexity  theory  is  arguably  the  foundation  for  analysis  of  computer  algorithms. 
The  goal  of  the  theory  is  twofold:  to  develop  criteria  for  measuring  the  effectiveness 
of  various  algorithms  (and  thus,  be  able  to  compare  algorithms  using  these  criteria), 
and  to  assess  the  inherent  difficulty  of  various  problems. 

The  term  complexity  refers  to  the  amount  of  resources  required  by  a  computa¬ 
tion.  In  this  chapter  we  focus  on  a  particular  resource,  namely,  computing  time.  In 
complexity  theory,  however,  one  is  not  interested  in  the  execution  time  of  a  pro¬ 
gram  implemented  in  a  particular  programming  language,  running  on  a  particular 
computer  over  a  particular  input.  This  involves  too  many  contingent  factors.  In¬ 
stead,  one  wishes  to  associate  to  an  algorithm  more  intrinsic  measures  of  its  time 
requirements. 

Roughly  speaking,  to  do  so  one  needs  to  define: 

•  a  notion  of  input  size , 

•  a  set  of  basic  operations ,  and 

•  a  cost  for  each  basic  operation. 

The  last  two  allow  one  to  associate  a  cost  of  a  computation.  If  v  is  any  input,  the 
cost  C(x )  of  the  computation  with  input  v  is  the  sum  of  the  costs  of  all  the  basic 
operations  performed  during  this  computation. 

Let  be  an  algorithm  and  ffn  be  the  set  of  all  its  inputs  having  size  n.  The 
worst-case  cost  function  of  is  the  function  T ™  defined  by 

=  sup  C(x). 

X^-ZT n 

If  there  is  a  probability  structure  on  ffn  it  is  possible  to  define  the  average-case  cost 
function  T ^  given  by 

T^n)  =  E  n(C(x)). 

where  En  is  the  expectation  over  J~n.  However,  the  average  is  usually  more  difficult 
to  find,  and  there  is  of  course  the  issue  of  what  probabilities  to  assign. 

We  now  discuss  how  the  objects  in  the  three  items  above  are  selected.  The  selec¬ 
tion  of  a  set  of  basic  operations  is  generally  easy.  For  the  algorithms  we  consider 
in  this  chapter,  the  obvious  choice  is  the  set  {+,  -,  x,  /,  <}  of  the  four  arithmetic 
operations  and  the  comparison.  Selecting  a  notion  of  input  size  and  a  cost  for  the 
basic  operations  depends  on  the  kind  of  data  dealt  with  by  the  algorithm.  Some 
kinds  can  be  represented  within  a  fixed  amount  of  computer  memory;  others  require 
a  variable  amount. 

Examples  of  the  first  are  fixed-precision  floating-point  numbers,  stored  in  a  fixed 
amount  of  memory  (usually  32  or  64  bits).  For  this  kind  of  data  the  size  of  an 
element  is  usually  taken  to  be  1  and  consequently  to  have  unit  size  per  number. 

Examples  of  the  second  are  integer  numbers  which  require  a  number  of  bits 
approximately  equal  to  the  logarithm  of  their  absolute  value.  This  (base  2)  logarithm 
is  usually  referred  to  as  the  bit  size  of  the  integer.  Similar  ideas  apply  for  rational 
numbers. 
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Let  A  be  some  kind  of  data  and  x  =  (x\,  . . . ,  xn)  e  An.  If  A  is  of  the  first  kind 
above  then  we  define  size(x)  =  n.  Otherwise,  we  define  size(x)  =  Y!U\  bit-size  (;q). 

The  cost  of  operating  on  two  unit- size  numbers  is  taken  to  be  1  and  is  called  the 
unit  cost.  In  the  bit-size  case,  the  cost  of  operating  on  two  numbers  is  the  product  of 
their  bit- sizes  (for  multiplications  and  divisions)  or  their  maximum  (for  additions, 
subtractions,  and  comparisons). 

The  consideration  of  integer  or  rational  data  with  their  associated  bit  size  and 
bit  cost  for  the  arithmetic  operations  is  usually  referred  to  as  the  Turing  model  of 
computation.  The  consideration  of  idealized  reals  with  unit  size  and  unit  cost  is 
today  referred  as  the  real  number  arithmetic  model.  When  comparing  algorithms, 
one  should  make  clear  which  model  of  computation  is  used  to  derive  complexity 
bounds. 

A  basic  concept  related  to  both  models  of  computation  is  that  of  polynomial 
time.  An  algorithm  2R  is  said  to  be  a  polynomial  time  algorithm  if  T^fri)  is  bounded 
above  by  a  polynomial.  A  problem  can  be  solved  in  polynomial  time  if  there  is  a 
polynomial  time  algorithm  solving  the  problem.  The  notion  of  average  polynomial 
time  is  defined  similarly,  replacing  7^  by  T^. 

The  notion  of  polynomial  time  is  usually  taken  as  the  formalization  of  efficiency 
in  complexity  theory. 


*5.2  *The  Simplex  Method  Is  Not  Polynomial-Time 

When  the  simplex  method  is  used  to  solve  a  linear  program  in  standard  form  with 
coefficient  matrix  A  e  Emxn,  b  e  Em  and  c  e  En,  the  number  of  pivot  steps  to  solve 
the  problem  starting  from  a  basic  feasible  solution  is  typically  a  small  multiple  of 
m:  usually  between  2m  and  3m.  In  fact,  Dantzig  observed  that  for  problems  with 
m  <  50  and  n  <  200  the  number  of  iterations  is  ordinarily  less  than  1.5m. 

At  one  time  researchers  believed — and  attempted  to  prove — that  the  simplex 
algorithm  (or  some  variant  thereof)  always  requires  a  number  of  iterations  that  is 
bounded  by  a  polynomial  expression  in  the  problem  size.  That  was  until  Victor  Klee 
and  George  Minty  exhibited  a  class  of  linear  programs  each  of  which  requires  an 
exponential  number  of  iterations  when  solved  by  the  conventional  simplex  method. 

One  form  of  the  Klee-Minty  example  is 

n 

X  10"A 

7=1 

i- 1 

2  V  io ‘-Jxj  +  Xi  <  mr'i  =  i (5.i) 

j=i 

Xj  >  0  j  =  1 ,  . . . ,  n 
The  problem  above  is  easily  cast  as  a  linear  program  in  standard  form. 


maximize 


subject  to 
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A  specific  case  is  that  for  n  =  3,  giving 

maximize  lOOxi  +  10^2  +  -*3 
subject  to  x\  <1 

20xi  +  X2  <100 

200xi  +  20x2  +  X3  <  10,000 

xi  ^  0,  X2  >  0,  X3  >  0. 


In  this  case,  we  have  three  constraints  and  three  variables  (along  with  their  non¬ 
negativity  constraints).  After  adding  slack  variables,  the  problem  is  in  standard  form. 
The  system  has  m  =  3  equations  and  n  -  6  nonnegative  variables.  It  can  be  verified 
that  it  takes  23  -  1  =7  pivot  steps  to  solve  the  problem  with  the  simplex  method 
when  at  each  step  the  pivot  column  is  chosen  to  be  the  one  with  the  largest  (because 
this  a  maximization  problem)  reduced  cost.  (See  Exercise  1.) 

The  general  problem  of  the  class  (1)  takes  2n  -  1  pivot  steps  and  this  is  in  fact 
the  number  of  vertices  minus  one  (which  is  the  starting  vertex).  To  get  an  idea  of 
how  bad  this  can  be,  consider  the  case  where  n  =  50.  We  have  250  -  1  «  1015.  In 
a  year  with  365  days,  there  are  approximately  3  x  107  s.  If  a  computer  ran  contin¬ 
uously,  performing  a  million  pivots  of  the  simplex  algorithm  per  second,  it  would 
take  approximately 


1015 

3  x  107  x  106 


»  33  years 


to  solve  a  problem  of  this  class  using  the  greedy  pivot  selection  rule. 

Although  it  is  not  polynomial  in  the  worst  case,  the  simplex  method  remains 
one  of  major  solvers  for  linear  programming.  In  fact,  the  method  has  been  recently 
proved  to  be  (strongly)  polynomial  for  solving  the  Markov  Decision  Process  with 
any  fixed  discount  rate. 


*5.3  *The  Ellipsoid  Method 

The  basic  ideas  of  the  ellipsoid  method  stem  from  research  done  in  the  1960s  and 
1970s  mainly  in  the  Soviet  Union  (as  it  was  then  called)  by  others  who  preceded 
Khachiyan.  In  essence,  the  idea  is  to  enclose  the  region  of  interest  in  ever  smaller 
ellipsoids. 

The  significant  contribution  of  Khachiyan  was  to  demonstrate  in  that  under  cer¬ 
tain  assumptions,  the  ellipsoid  method  constitutes  a  polynomially  bounded  algorithm 
for  linear  programming. 

The  version  of  the  method  discussed  here  is  really  aimed  at  finding  a  point  of  a 
polyhedral  set  Q  given  by  a  system  of  linear  inequalities. 

Q  =  {y  e  Em  :  yra j  <  cj,  j  =  1,  . .  ,n) 
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Finding  a  point  of  Q  can  be  thought  of  as  equivalent  to  solving  a  linear  programming 
problem. 

Two  important  assumptions  are  made  regarding  this  problem: 

(Al)  There  is  a  vector  yo  e  Em  and  a  scalar  R  >  0  such  that  the  closed  ball  S  (yo,  R) 
with  center  yo  and  radius  R ,  that  is 

{yeEm:\y-y0\<R}, 

contains  Q. 

(A2)  If  Q  is  nonempty,  there  is  a  known  scalar  r  >  0  such  that  Q  contains  a  ball 
of  the  form  S  (y*,  r)  with  center  at  y*  and  radius  r.  (This  assumption  implies 
that  if  Q  is  nonempty,  then  it  has  a  nonempty  interior  and  its  volume  is  at  least 
vol(S(0,  r)).)2 

Definition.  An  ellipsoid  in  Em  is  a  set  of  the  form 

E  =  (y  e  Em  :  (y  -  z)rQ(y  -  z)  <  1) 

where  z  e  E,n  is  a  given  point  (called  the  center )  and  Q  is  a  positive  definite  matrix  (see 
Sect.  A.4  of  Appendix  A)  of  dimension  mxm.  This  ellipsoid  is  denoted  E( z,  Q). 

The  unit  sphere  S'  (0,  1)  centered  at  the  origin  0  is  a  special  ellipsoid  with  Q  =  I,  the 
identity  matrix. 

The  axes  of  a  general  ellipsoid  are  the  eigenvectors  of  Q  and  the  lengths  of  the 
axes  are  T~1/2,  d~1/2,  . . . ,  T”1  ,  where  the  Af  s  are  the  corresponding  eigenvalues. 
It  can  be  shown  that  the  volume  of  an  ellipsoid  is 

vol(£)  =  vol(S (0,  l))n”,i:1/2  =  vol(S(0,  l))det(Q-1/2). 

L  l 


Cutting  Plane  and  New  Containing  Ellipsoid 

In  the  ellipsoid  method,  a  series  of  ellipsoids  E k  is  defined,  with  centers  y^  and  with 
the  defining  Q  =  B^1,  where  is  symmetric  and  positive  definite. 

At  each  iteration  of  the  algorithm,  we  have  Q  c  Ek.  It  is  then  possible  to  check 
whether  y^  e  Q.  If  so,  we  have  found  an  element  of  Q  as  required.  If  not,  there  is  at 
least  one  constraint  that  is  violated.  Suppose  ajy ^  >  Cj.  Then 

^  C  ]rEk  =  {y  €  Ek  :  ajy  <  a]yk) 

This  set  is  half  of  the  ellipsoid,  obtained  by  cutting  the  ellipsoid  in  half  through  its 
center  (Fig.  5.1). 


2  The  (topological)  interior  of  any  set  is  the  set  of  points  in  f2  which  are  the  centers  of  some  balls 
contained  in  Q. 
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The  successor  ellipsoid  Ek+ 1  is  defined  to  be  the  minimal- volume  ellipsoid 
containing  (1/2 )£*.  It  is  constructed  as  follows.  Define 


T  = 


CT  -  2  T. 


Fig.  5.1  A  half-ellipsoid 


Then  put 


y*+i  _  y k 


(ajB,a,-)1/2 


B/a 


B/a^B/ ' 


B/,-+ 1  -  <5 


Bt  -  cr- 


ajBta; 


(5.2) 


Theorem  1.  The  ellipsoid  £),+1  =  E(  y^+i,  B  ',)  defined  as  above  is  the  ellipsoid  of  least 
volume  containing  (1/2)ET  Moreover, 

vol(£ft+i)  _  /  m2  \(m~1)/2  m  <  l  1  \ 

vol/E^)  \m2  -  1  /  m+1  ^ \  2(m  +1)/ 


Proof.  We  shall  not  prove  the  statement  about  the  new  ellipsoid  being  of  least 
volume,  since  that  is  not  necessary  for  the  results  that  follow.  To  prove  the  remainder 
of  the  statement,  we  have 


voKfffr+i)  _  ) 

vol(£t)  det(B'/2) 

For  simplicity,  by  a  change  of  coordinates,  we  may  take  =  I.  Then  B^+i  has  m—  1 

2  2  o 

eigenvalues  equal  to  5  =  and  one  eigenvalue  equal  to  6-26r  =  ;Jry(l  -  ~^f\)  = 
(^-)2.  The  reduction  in  volume  is  the  product  of  the  square  roots  of  these,  giving 
the  equality  in  the  theorem. 
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Then  using  (1  +  x)p  <  exp ,  we  have 


mx 


(m- 1)/2 


m 


m2  -  1 


m  +  1 


=  1  + 


1 


(m-l)/2 


m2  -  1 


<  exp 


1 


1  - 
1 


1 


m  +  1 


2(ra  +1)  (m  +  1) 


=  exp 


1 

2  (m+  1) 


Convergence 

The  ellipsoid  method  is  initiated  by  selecting  yo  and  7?  such  that  condition  (Al)  is 
satisfied.  Then  Bo  =  R2 I,  and  the  corresponding  Eo  contains  Q.  The  updating  of  the 
Ek  s  is  continued  until  a  solution  is  found. 

Under  the  assumptions  stated  above,  a  single  repetition  of  the  ellipsoid  method 
reduces  the  volume  of  an  ellipsoid  to  one-half  of  its  initial  value  in  0(m )  iterations. 
(See  Appendix  A  for  O  notation.)  Hence  it  can  reduce  the  volume  to  less  than  that 
of  a  sphere  of  radius  r  in  0(m2  \og(R/r))  iterations,  since  its  volume  is  bounded 
from  below  by  vol(S(0,  l))^71  and  the  initial  volume  is  vol^S  (0?  1  ))Rm.  Generally 
a  single  iteration  requires  0(m2)  arithmetic  operations.  Hence  the  entire  process 
requires  0(m3 4  lo g(R/r))  arithmetic  operations. 


Ellipsoid  Method  for  Usual  Form  of  LP 


Now  consider  the  linear  program  (where  A  is  m  x  n) 


and  its  dual 


maximize  cTx 
subject  to  Ax  <  b,  x  >  0 


minimize  yrb 

subject  to  yrA  >  cr,  y  >  0. 


Note  that  both  problems  can  be  solved  by  finding  a  feasible  point  to  inequalities 


-crx  +  bry  <0 
Ax  <  b 

-Ary  <  -c  (5.3) 

x,  y  >0, 


where  both  x  and  y  are  variables.  Thus,  the  total  number  of  arithmetic  operations 
for  solving  a  linear  program  is  bounded  by  0((m  +  n )4  lo g(R/r)). 


3  Assumption  (A2)  is  sometimes  too  strong.  It  has  been  shown,  however,  that  when  the  data  consists 

of  integers,  it  is  possible  to  perturb  the  problem  so  that  (A2)  is  satisfied  and  if  the  perturbed  problem 
has  a  feasible  solution,  so  does  the  original  f2. 


5.4  The  Analytic  Center 
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5.4  The  Analytic  Center 

The  new  interior-point  algorithms  introduced  by  Karmarkar  move  by  successive 
steps  inside  the  feasible  region.  It  is  the  interior  of  the  feasible  set  rather  than  the  ver¬ 
tices  and  edges  that  plays  a  dominant  role  in  this  type  of  algorithm.  In  fact,  these 
algorithms  purposely  avoid  the  edges  of  the  set,  only  eventually  converging  to  one 
as  a  solution. 

Our  study  of  these  algorithms  begins  in  the  next  section,  but  it  is  useful  at  this 
point  to  introduce  a  concept  that  definitely  focuses  on  the  interior  of  a  set,  termed 
the  set’s  analytic  center.  As  the  name  implies,  the  center  is  away  from  the  edge. 

In  addition,  the  study  of  the  analytic  center  introduces  a  special  structure,  termed 
a  barrier  or  potential  that  is  fundamental  to  interior-point  methods. 

Consider  a  set  S  in  a  subset  of  X  of  En  defined  by  a  group  of  inequalities  as 

S  =  {xeX  :  gj(x)  >  0,  j  =  1, 2,  . . . ,  m}, 

o 

and  assume  that  the  functions  gj  are  continuous.  S  has  a  nonempty  interior  S  = 
(xg^:  g/x)  >  0,  all  j}.  Associated  with  this  definition  of  the  set  is  the  potential 
function 

m 

<A(x)  =  -  ^  log  g ;(x) 

7=1 

o 

defined  on  S. 

The  analytic  center  of  S  is  the  vector  (or  set  of  vectors)  that  minimizes  the 
potential;  that  is,  the  vector  (or  vectors)  that  solve 


m 

-  log  g;(x) :  x  €  X,  gj(x )  >  0  for  each  j  > . 

l=l  J 

Example  1  (A  Cube).  Consider  the  set  S  defined  by  xi  >  0,  (1  -  xt)  >  0,  for  i  = 
1,2,  . . . ,  n.  This  is  S  =  [0,  l]n,  the  unit  cube  in  En.  The  analytic  center  can  be  found 
by  differentiation  to  be  X[  =  1/2,  for  all  i.  Hence,  the  analytic  center  is  identical  to 
what  one  would  normally  call  the  center  of  the  unit  cube. 

In  general,  the  analytic  center  depends  on  how  the  set  is  defined — on  the  partic¬ 
ular  inequalities  used  in  the  definition.  For  instance,  the  unit  cube  is  also  defined  by 
the  inequalities  xt  >  0,  (1  -  X,)d  >  0  with  odd  d  >  1 .  In  this  case  the  solution  is 
Xi  -  1  /(d  +  1)  for  all  i.  For  large  d  this  point  is  near  the  inner  corner  of  the  unit  cube. 

Also,  the  addition  of  redundant  inequalities  can  change  the  location  of  the 
analytic  center.  For  example,  repeating  a  given  inequality  will  change  the  center’s 
location. 

There  are  several  sets  associated  with  linear  programs  for  which  the  analytic 
center  is  of  particular  interest.  One  such  set  is  the  feasible  region  itself.  Another  is 
the  set  of  optimal  solutions.  There  are  also  sets  associated  with  dual  and  primal-dual 
formulations.  All  of  these  are  related  in  important  ways. 


min  i/f(x)  =  min 
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Let  us  illustrate  by  considering  the  analytic  center  associated  with  a  bounded 
polytope  Q  in  Em  represented  by  n(>  m )  linear  inequalities;  that  is, 

Q  =  (yer:cr-yrA>  0}, 

where  A  e  EmXn  and  c  e  En  are  given  and  A  has  rank  m.  Denote  the  interior  of  Q  by 

Q  =  (ye  Em  :  cT  -  yrA  >  0}. 


The  potential  function  for  this  set  is 


n 


n 


<An(y)  =  ~Yj  log(c> “  y/a/)  =  _  L log 

j= 1  7=1 


Sj’ 


(5.4) 


where  s  =  c  -  Ary  is  a  slack  vector .  Hence  the  potential  function  is  the  negative 
sum  of  the  logarithms  of  the  slack  variables. 

The  analytic  center  of  Q  is  the  interior  point  of  Q  that  minimizes  the  poten¬ 
tial  function.  This  point  is  denoted  by  ya  and  has  the  associated  sa  =  c  -  ATya. 
The  pair  ( ya ,  sa)  is  uniquely  defined,  since  the  potential  function  is  strictly  convex 
(see  Sect.  7.4)  in  the  bounded  convex  set  Q. 

Setting  to  zero  the  derivatives  of  if/( y)  with  respect  to  each  yt  gives 


0,  for  all  i. 


which  can  be  written 

n 

an 

—  =  0,  for  all  i. 

sj 

Now  define  xj—  1  /  Sj  for  each  j.  We  introduce  the  notation 

xos  =  Cm Su  x2s2 ,  xnsn)T, 

which  is  component  multiplication.  Then  the  analytic  center  is  defined  by  the 
conditions 


x  o  s  =  1 
Ax  =  0 
Ary  +  s  =  c. 

The  analytic  center  can  be  defined  when  the  interior  is  empty  or  equalities  are 
present,  such  as 

Q  =  jyer  :cr-yrA>  0,  By  =  b}. 

In  this  case  the  analytic  center  is  chosen  on  the  linear  surface  {y  :  By  =  b}  to 
maximize  the  product  of  the  slack  variables  s  =  c  -  Ary.  Thus,  in  this  context 
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the  interior  of  Q  refers  to  the  interior  of  the  positive  orthant  of  slack  variables: 
Rn+  =  {s  :  s  ^  0}.  This  definition  of  interior  depends  only  on  the  region  of  the  slack 
variables.  Even  if  there  is  only  a  single  point  in  Q  with  s  =  c  -  Ary  for  some  y 

o 

where  By  =  b  with  s  >  0,  we  still  say  that  Q  is  not  empty. 


5.5  The  Central  Path 

The  concept  underlying  interior-point  methods  for  linear  programming  is  to  use 
nonlinear  programming  techniques  of  analysis  and  methodology.  The  analysis  is 
often  based  on  differentiation  of  the  functions  defining  the  problem.  Traditional 
linear  programming  does  not  require  these  techniques  since  the  defining  functions 
are  linear.  Duality  in  general  nonlinear  programs  is  typically  manifested  through 
Lagrange  multipliers  (which  are  called  dual  variables  in  linear  programming).  The 
analysis  and  algorithms  of  the  remaining  sections  of  the  chapter  use  these  nonlin¬ 
ear  techniques.  These  techniques  are  discussed  systematically  in  later  chapters,  so 
rather  than  treat  them  in  detail  at  this  point,  these  current  sections  provide  only 
minimal  detail  in  their  application  to  linear  programming.  It  is  expected  that  most 
readers  are  already  familiar  with  the  basic  method  for  minimizing  a  function  by  set¬ 
ting  its  derivative  to  zero,  and  for  incorporating  constraints  by  introducing  Lagrange 
multipliers.  These  methods  are  discussed  in  detail  in  Chaps.  11-15. 

The  computational  algorithms  of  nonlinear  programming  are  typically  iterative 
in  nature,  often  characterized  as  search  algorithms.  At  any  step  with  a  given  point, 
a  direction  for  search  is  established  and  then  a  move  in  that  direction  is  made  to 
define  the  next  point.  There  are  many  varieties  of  such  search  algorithms  and  they 
are  systematically  presented  throughout  the  text.  In  this  chapter,  we  use  versions  of 
Newton’s  method  as  the  search  algorithm,  but  we  postpone  a  detailed  study  of  the 
method  until  later  chapters. 

Not  only  have  nonlinear  methods  improved  linear  programming,  but  interior- 
point  methods  for  linear  programming  have  been  extended  to  provide  new  ap¬ 
proaches  to  nonlinear  programming.  This  chapter  is  intended  to  show  how  this 
merger  of  linear  and  nonlinear  programming  produces  elegant  and  effective  methods. 
These  ideas  take  an  especially  pleasing  form  when  applied  to  linear  programming. 
Study  of  them  here,  even  without  all  the  detailed  analysis,  should  provide  good 
intuitive  background  for  the  more  general  manifestations. 

Consider  a  primal  linear  program  in  standard  form 

(LP)  minimize  c  x  (5.5) 

subject  to  Ax  =  b,  x  >  0. 

o 

We  denote  the  feasible  region  of  this  program  by  Tp.  We  assume  that  Tp  —  {x  : 
Ax  =  b,  x  >  0}  is  nonempty  and  the  optimal  solution  set  of  the  problem  is  bounded. 
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Associated  with  this  problem,  we  define  for  p  >  0  the  barrier  problem 


n 


(BP)  minimize  cTx-  p  Zlog 

;'=  i 

subject  to  Ax  =  b,  x  >  0. 


X; 


(5.6) 


It  is  clear  that  p  =  0  corresponds  to  the  original  problem  (5.5).  As  p  — >  oo,  the 
solution  approaches  the  analytic  center  of  the  feasible  region  (when  it  is  bounded), 
since  the  barrier  term  swamps  out  cTx  in  the  objective.  As  p  is  varied  continuously 
toward  0,  there  is  a  path  x(p)  defined  by  the  solution  to  (BP).  This  path  x(p)  is 
termed  the  primal  central  path .  As  p  — >  0  this  path  converges  to  the  analytic  center 
of  the  optimal  face  {x  :  cTx  =  z*,  Ax  =  b,  x  >  0},  where  z*  is  the  optimal  value  of 
(LP). 

A  strategy  for  solving  (LP)  is  to  solve  (BP)  for  smaller  and  smaller  values  of  p 
and  thereby  approach  a  solution  to  (LP).  This  is  indeed  the  basic  idea  of  interior- 
point  methods. 

At  any  p  >  0,  under  the  assumptions  that  we  have  made  for  problem  (5.5),  the 
necessary  and  sufficient  conditions  for  a  unique  and  bounded  solution  are  obtained 
by  introducing  a  Lagrange  multiplier  vector  y  for  the  linear  equality  constraints  to 
form  the  Lagrangian  (see  Chap.  11) 

n 

cTx~, u  2^  log  Xj  -  yT(\x  -  b). 

7=1 

The  derivatives  with  respect  to  the  x/s  are  set  to  zero,  leading  to  the  conditions 

Cj  -  p/xj  -  yra  j  =  0,  for  each  j 


or  equivalently 


/iX-1l  +  Ary  =  c 


(5.7) 


where  as  before  ay-  is  the  jth  column  of  A,  1  is  the  vector  of  1  ’s,  and  X  is  the  diagonal 
matrix  whose  diagonal  entries  are  the  components  of  x  >  0.  Setting  sj  -  p/xj  the 
complete  set  of  conditions  can  be  rewritten 


x  o  s  =  pi 

Ax  =  b  (5.8) 

Ary  +  s  =  c. 

Note  that  y  is  a  dual  feasible  solution  and  c  -  Ary  >  0  (see  Exercise  4). 

Example  2  (A  Square  Primal).  Consider  the  problem  of  maximizing  x\  within  the 
unit  square  S  =  [0,  l]2.  The  problem  is  formulated  as 


5.5  The  Central  Path 
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min  -x\ 

S.t.  X\  +  V3  =  1 

X2  +  X4  =  1 


x\  >  0,  X2  >  0,  X3  >  0,  X4  >  0. 


Here  V3  and  X4  are  slack  variables  for  the  original  problem  to  put  it  in  standard 
form.  The  optimality  conditions  for  x(ja)  consist  of  the  original  two  linear  constraint 
equations  and  the  four  equations 


yi  +  Si  =  -1,  T2  +  =  0,  yi  +  S3  =  0,  y2  +  =  0 

together  with  the  relations  S[  =  p/xi  for  i  =  1,  2  . . . ,  4.  These  equations  are  readily 
solved  with  a  series  of  elementary  variable  eliminations  to  find 

1  -  2ju  ±  V 1  +  V 
xi(M)  =  - - - 

xiip)  =  1/2. 

Using  the  “+”  solution,  it  is  seen  that  as  p  — >  0  the  solution  goes  to  x  — »  (1,  1/2). 
Note  that  this  solution  is  not  a  corner  of  the  cube.  Instead  it  is  at  the  analytic  center 
of  the  optimal  face  {x  :  x\  -  1,  0  <  *2  <  1}.  See  Fig.  5.2.  The  limit  of  x(jj)  as 
Id  — »  00  can  be  seen  to  be  the  point  (1/2,  1/2).  Hence,  the  central  path  in  this  case  is 
a  straight  line  progressing  from  the  analytic  center  of  the  square  (at  id  — >  00)  to  the 
analytic  center  of  the  optimal  face  (at  /i  — >  0). 


Dual  Central  Path 

Now  consider  the  dual  problem 

(LD)  maximize  yrb 

subject  to  yrA  +  sr  =  cr,  s  >  0. 

We  may  apply  the  barrier  approach  to  this  problem  by  formulating  the  problem 

n 

(BD)  maximize  yrb  +  /i  Z log  si 

j=  1 

subject  to  yrA  +  sr  =  cr,  s  >  0. 

We  assume  that  the  dual  feasible  set  Td  has  an  interior  ^  =  {(y,  s)  :  yrA  +  sT  = 
cr,  s  >  0}  is  nonempty  and  the  optimal  solution  set  of  (LD)  is  bounded.  Then,  as  id 
is  varied  continuously  toward  0,  there  is  a  path  (y (//),  s (jd))  defined  by  the  solution 
to  (BD).  This  path  is  termed  the  dual  central  path . 
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x2 

Ir 


o - ► 


Fig.  5.2  The  analytic  path  for  the  square 


To  work  out  the  necessary  and  sufficient  conditions  we  introduce  x  as  a  Lagrange 
multiplier  and  form  the  Lagrangian 

yrb  +  //^logjr  (yrA  +  sr  -  cr)x. 

j=  i 

Setting  to  zero  the  derivative  with  respect  to  yt  leads  to 

bi  -  a'x  =  0,  for  all  i 

where  a1  is  the  ith  row  of  A.  Setting  to  zero  the  derivative  with  respect  to  Sj  leads  to 

jd/sj  -  xj  =  0,  or  1  -  XjSj  =  0,  for  all  j. 


Combining  these  equations  and  including  the  original  constraint  yields  the  complete 
set  of  conditions  which  are  identical  to  the  optimality  conditions  for  the  primal 
central  path  (5.8).  Note  that  x  is  indeed  a  primal  feasible  solution  and  x  >  0. 

To  see  the  geometric  representation  of  the  dual  central  path,  consider  the  dual 
level  set 

Q(z)  =  {y  :  cT  -  yTA  >  0,  yrb  >  z) 

for  any  z  <  z*  where  z*  is  the  optimal  value  of  (LD).  Then,  the  analytic  center 
(y(z),  s(z))  of  Q(z)  coincides  with  the  dual  central  path  as  z  tends  to  the  optimal 
value  z*  from  below.  This  is  illustrated  in  Fig.  5.3,  where  the  feasible  region  of  the 
dual  set  (not  the  primal)  is  shown.  The  level  sets  Q(z)  are  shown  for  various  values 
of  z.  The  analytic  centers  of  these  level  sets  correspond  to  the  dual  central  path. 

Example  3  ( The  Square  Dual).  Consider  the  dual  of  Example  2.  This  is 

max  y\  +  y 2 

subjectto  yi  < -1 

T2  <  0. 

(The  values  of  s\  and  ^2  are  the  slack  variables  of  the  inequalities.)  The  solution 
to  the  dual  barrier  problem  is  easily  found  from  the  solution  of  the  primal  barrier 
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problem  to  be 


yi(jj)  =  -1  ~n/x y2  =  -2fi. 


Fig.  5.3  The  central 


As  p  — >  0,  we  have  y\  — >  -1,  — 5 ►  0,  which  is  the  unique  solution  to  the  dual  LP. 

However,  as  p  — >  oo,  the  vector  y  is  unbounded,  for  in  this  case  the  dual  feasible  set 
is  itself  unbounded. 


Primal-Dual  Central  Path 

Suppose  the  feasible  region  of  the  primal  (LP)  has  interior  points  and  its  optimal 
solution  set  is  bounded.  Then,  the  dual  also  has  interior  points  (see  Exercise  4).  The 
primal-dual  path  is  defined  to  be  the  set  of  vectors  >  0,  yip),  s ip)  >  0)  that 
satisfy  the  conditions 


x  o  s  =  pi 
Ax  =  b 
Ary  +  s  =  c 

for  0  <  p  <  oo.  Hence  the  central  path  is  defined  without  explicit  reference  to 
an  optimization  problem.  It  is  simply  defined  in  terms  of  the  set  of  equality  and 
inequality  conditions. 

Since  conditions  (5.8)  and  (5.9)  are  identical,  the  primal-dual  central  path  can  be 
split  into  two  components  by  projecting  onto  the  relevant  space,  as  described  in  the 
following  proposition. 

Proposition  1.  Suppose  the  feasible  sets  of  the  primal  and  dual  programs  contain  interior 
points.  Then  the  primal-dual  central  path  ( x(p ),  y(p),  s (p))  exists  for  all  p,  0  <  p  <  oo. 
Furthermore,  x(p)  is  the  primal  central  path,  and  (y (p),  s (p))  is  the  dual  central  path.  More- 
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over,  x(p)  and  (y (p),  s (p))  converge  to  the  analytic  centers  of  the  optimal  primal  solution 
and  dual  solution  faces,  respectively,  as  p  ^  0. 


Duality  Gap 

Let  (x(//),  yip),  s ip))  be  on  the  primal-dual  central  path.  Then  from  (5.9)  it  follows 
that 

F  t  1  Ft  1  F  t  1  F  t  1  F  t  1  F  t  1 

c  x-y  b  =  y  Axis  x-y  b  =  s  x  =  np. 

The  value  cTx  -  yrb  =  sTx  is  the  difference  between  the  primal  objective  value 
and  the  dual  objective  value.  This  value  is  always  nonnegative  (see  the  weak  duality 
lemma  in  Sect.  4.2)  and  is  termed  the  duality  gap.  At  any  point  on  the  primal-dual 
central  path,  the  duality  gap  is  equal  to  np.  It  is  clear  that  as  p  — >  0  the  duality 
gap  goes  to  zero,  and  hence  both  x(p)  and  (y ip),  sip))  approach  optimality  for  the 
primal  and  dual,  respectively. 

The  duality  gap  provides  a  measure  of  closeness  to  optimality.  For  any  primal 
feasible  x,  the  value  cTx  gives  an  upper  bound  as  crx  >  z*  where  z*  is  the  optimal 
value  of  the  primal.  Likewise,  for  any  dual  feasible  pair  (y,  s),  the  value  yrb  gives 
a  lower  bound  as  yrb  <  z*.  The  difference,  the  duality  gap  g  =  cTx  -  yrb,  provides 
a  bound  on  z*  as  z*  >  cTx  -  g.  Hence  if  at  a  feasible  point  x,  a  dual  feasible  (y,  s) 
is  available,  the  quality  of  x  can  be  measured  as  cTx  -  z*  <  g. 


5.6  Solution  Strategies 

The  various  definitions  of  the  central  path  directly  suggest  corresponding  strate¬ 
gies  for  solution  of  a  linear  program.  We  outline  three  general  approaches  here: 
the  primal  barrier  or  path-following  method,  the  primal-dual  path-following  method 
and  the  primal-dual  potential-reduction  method,  although  the  details  of  their  im¬ 
plementation  and  analysis  must  be  deferred  to  later  chapters  after  study  of  general 
nonlinear  methods.  Table  5.1  depicts  these  solution  strategies  and  the  simplex  meth¬ 
ods  described  in  Chaps.  3  and  4  with  respect  to  how  they  meet  the  three  optimality 
conditions:  Primal  Feasibility,  Dual  Feasibility,  and  Zero-Duality  during  the  itera¬ 
tive  process. 


Table  5.1  Properties  of  algorithms 


P-F 

D-F 

0-Duality 

Primal  simplex 

V 

V 

Dual  simplex 

Primal  barrier 

v 

V 

V 

Primal-dual  path-following 

V 

V 

Primal-dual  potential-reduction 

V 

V 
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For  example,  the  primal  simplex  method  keeps  improving  a  primal  feasible  so¬ 
lution,  maintains  the  zero-duality  gap  (complementarity  slackness  condition)  and 
moves  toward  dual  feasibility;  while  the  dual  simplex  method  keeps  improving  a 
dual  feasible  solution,  maintains  the  zero-duality  gap  (complementarity  condition) 
and  moves  toward  primal  feasibility  (see  Sect.  4.3).  The  primal  barrier  method  keeps 
improving  a  primal  feasible  solution  and  moves  toward  dual  feasibility  and  comple¬ 
mentarity;  and  the  primal-dual  interior-point  methods  keep  improving  a  primal  and 
dual  feasible  solution  pair  and  move  toward  complementarity. 


Primal  Barrier  Method 


A  direct  approach  is  to  use  the  barrier  construction  and  solve  the  the  problem 


minimize 
subject  to 


Ax  =  b,  x  >  0, 


(5.9) 


for  a  very  small  value  of  /i.  In  fact,  if  we  desire  to  reduce  the  duality  gap  to  s  it  is 
only  necessary  to  solve  the  problem  for  ji  -  s/n.  Unfortunately,  when  //  is  small, 
the  problem  (5.9)  could  be  highly  ill-conditioned  in  the  sense  that  the  necessary 
conditions  are  nearly  singular.  This  makes  it  difficult  to  directly  solve  the  problem 
for  small  /i. 

An  overall  strategy,  therefore,  is  to  start  with  a  moderately  large  /i  (say  /i  = 
100)  and  solve  that  problem  approximately.  The  corresponding  solution  is  a  point 
approximately  on  the  primal  central  path,  but  it  is  likely  to  be  quite  distant  from  the 
point  corresponding  to  the  limit  of  /i  — >  0.  However  this  solution  point  at  /i  =  100 
can  be  used  as  the  starting  point  for  the  problem  with  a  slightly  smaller  //,  for  this 
point  is  likely  to  be  close  to  the  solution  of  the  new  problem.  The  value  of  //  might 
be  reduced  at  each  stage  by  a  specific  factor,  giving  ju^+i  =  yi*k,  where  y  is  a  fixed 
positive  parameter  less  than  one  and  k  is  the  stage  count. 

If  the  strategy  is  begun  with  a  value  jjq,  then  at  the  kth  stage  we  have  -  ykjio . 
Hence  to  reduce  /^//A)  to  below  s ,  requires 

logg 

logy 


stages. 

Often  a  version  of  Newton’s  method  for  minimization  is  used  to  solve  each  of  the 
problems.  For  the  current  strategy,  Newton’s  method  works  on  problem  (5.9)  with 
fixed  /i  by  considering  the  central  path  equations  (5.8) 

X  O  S  =  //I 

Ax  =  b 
Ary  +  s  =  c. 


(5.10) 
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From  a  given  point  x  e  Tp,  Newton’s  method  moves  to  a  closer  point  x+  e  Tp 
by  moving  in  the  directions  dx,  dy  and  ds  determined  from  the  linearized  version 
of (5.10) 

/iX~2dx  +  ds  =  /iX_1 1  -  c, 

Adx  =  0,  (5.11) 

-Ardy  -  ds  =  0. 


(Recall  that  X  is  the  diagonal  matrix  whose  diagonal  entries  are  components  of 
x  >  0.)  The  new  point  is  then  updated  by  taking  a  step  in  the  direction  of  dx,  as 
x+  =  x  +  dx. 

Notice  that  if  x  o  s  =  fi\  for  some  s  =  c  -  Ary,  then  d  =  (dx,  dy,  ds)  =  0  because 
the  current  point  satisfies  Ax  =  b  and  hence  is  already  the  central  path  solution 
for  /i.  If  some  component  of  x  o  s  is  less  than  jj,  then  d  will  tend  to  increment  the 
solution  so  as  to  increase  that  component.  The  converse  will  occur  for  components 
of  x  o  s  greater  than  //. 

This  process  may  be  repeated  several  times  until  a  point  close  enough  to  the 
proper  solution  to  the  barrier  problem  for  the  given  value  of  /i  is  obtained.  That  is, 
until  the  necessary  and  sufficient  conditions  (5.7)  are  (approximately)  satisfied. 

There  are  several  details  involved  in  a  complete  implementation  and  analysis  of 
Newton’s  method.  These  items  are  discussed  in  later  chapters  of  the  text.  However, 
the  method  works  well  if  either  //  is  moderately  large,  or  if  the  algorithm  is  initi¬ 
ated  at  a  point  very  close  to  the  solution,  exactly  as  needed  for  the  barrier  strategy 
discussed  in  this  subsection. 

To  solve  (5.1 1),  premultiplying  both  sides  by  X2  we  have 

n dx  +  X2ds  =  yuXl  -  X2c. 

Then,  premultiplying  by  A  and  using  Adx  =  0,  we  have 

AX2ds  =  MXl  -  AX2c. 


Using  ds  =  Ay  dy  we  have 

(AX2Ar)dy  =  -juAXl  +  AX2c. 

Thus,  dy  can  be  computed  by  solving  the  above  linear  system  of  equations.  Then  ds 
can  be  found  from  the  third  equation  in  (5.1 1)  and  finally  dx  can  be  found  from  the 
first  equation  in  (5.11),  together  this  amounts  to  0(nm 2  +  m3)  arithmetic  operations 
for  each  Newton  step. 
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Primal-Dual  Path-Following 

Another  strategy  for  solving  a  linear  program  is  to  follow  the  central  path  from  a 
given  initial  primal-dual  solution  pair.  Consider  a  linear  program  in  standard  form 

Primal  Dual 

minimize  cTx  maximize  yrb 

subject  to  Ax  —  b,  x  ^  0  subject  to  yrA  <  cr. 

O 

Assume  that  the  interior  of  both  primal  and  dual  feasible  regions  T  ±  0;  that  is, 
both4 

Tp  =  {x  :  Ax  =  b,  x  >  0}  ^  0  and  Td  -  {(y,  s)  :  s  =  c  -  Ary  >  0}  ^  0; 

and  denote  by  z*  the  optimal  objective  value. 

The  central  path  can  be  expressed  as 

C  =  j(x,  y,  s)  €  f  :  X  o  s  =  ^1  j 

in  the  primal-dual  form.  On  the  path  we  have  x  o  s  =  /il  and  hence  sTx  =  nju. 
A  neighborhood  of  the  central  path  C  is  of  the  form 

O  r-T-1 

N(rj)  =  {(x,y,  s)  e  T  :  \s  o  x  —  /x  1\  <  rj/u,  where  /i  =  s  x/n}  (5.12) 

for  some  rj  e  (0,  1),  say  rj  -  1/4.  This  can  be  thought  of  as  a  tube  whose  center  is 
the  central  path. 

The  idea  of  the  path-following  method  is  to  move  within  a  tubular  neighborhood 
of  the  central  path  toward  the  solution  point.  A  suitable  initial  point  (x°,  y°,  s°)  e 
N(rj)  can  be  found  by  solving  the  barrier  problem  for  some  fixed  /xo  or  from  an  ini¬ 
tialization  phase  proposed  later.  After  that,  step  by  step  moves  are  made,  alternating 
between  a  predictor  step  and  a  corrector  step.  After  each  pair  of  steps,  the  point 
achieved  is  again  in  the  fixed  given  neighborhood  of  the  central  path,  but  closer  to 
the  linear  program’s  solution  set. 

The  predictor  step  is  designed  to  move  essentially  parallel  to  the  true  central 
path.  The  step  d  =  (dx,  dy,  ds)  is  determined  from  the  linearized  version  of  the 
primal-dual  central  path  equations  of  (5.9),  as 

s  o  dx  +  x  o  ds  =  yjx  1  -  x  o  s, 

Adx  =  0,  (5.13) 

-A7  dv  -  ds  =  0, 


where  here  one  selects  y  =  0.  (To  show  the  dependence  of  d  on  the  current  pair 
(x,  s)  and  the  parameter  y,  we  write  d  =  d(x,  s,  y).) 


4  The  symbol  0  denotes  the  empty  set. 
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The  new  point  is  then  found  by  taking  a  step  in  the  direction  of  d,  as  (x+ ,  y+ ,  s+ )  = 
(x,  y,  s)  +  rr(dx,  dy,  ds),  where  a  is  the  step-size.  Note  that  djds  =  -dj  Ardy  =  0 
here.  Then 

(x+)rs+  =  (x  +  adx)T  (s  +  ads)  =  xTs  +  a(d[s  +  xrds)  =  (1  -  a)xT s, 

where  the  last  step  follows  by  multiplying  the  first  equation  in  (5.13)  by  1T .  Thus, 
the  predictor  step  reduces  the  duality  gap  by  a  factor  l  -  a.  The  maximum  possible 
step- size  a  in  that  direction  is  made  in  that  parallel  direction  without  going  outside 
of  the  neighborhood 

The  corrector  step  essentially  moves  perpendicular  to  the  central  path  in  order  to 
get  closer  to  it.  This  step  moves  the  solution  back  to  within  the  neighborhood  N(rj ), 
and  the  step  is  determined  by  selecting  y  —  1  in  (5.13)  with  p  =  xTs/n.  Notice  that 
if  x  o  s  =  //l,  then  d  =  0  because  the  current  point  is  already  a  central  path  solution. 

This  corrector  step  is  identical  to  one  step  of  the  barrier  method.  Note,  however, 
that  the  predictor-corrector  method  requires  only  one  sequence  of  steps,  each  con¬ 
sisting  of  a  single  predictor  and  corrector.  This  contrasts  with  the  barrier  method 
which  requires  a  complete  sequence  for  each  p  to  get  back  to  the  central  path,  and 
then  an  outer  sequence  to  reduce  the  ps. 

One  can  prove  that  for  any  (x,  y,  s)  e  N(rj )  with  p  =  xTs/n ,  the  step-size  in  the 
predictor  stop  satisfies 


1 


Thus,  the  iteration  complexity  of  the  method  is  0(  sfn)  log(l/£))  to  achieve  p/po  <  s 
where  np o  is  the  initial  duality  gap.  Moreover,  one  can  prove  that  the  step- size  a  — >  1 
as  xrs  — >  0,  that  is,  the  duality  reduction  speed  is  accelerated  as  the  gap  becomes 
smaller. 


Primal-Dual  Potential  Reduction  Algorithm 

In  this  method  a  primal-dual  potential  function  is  used  to  measure  the  solution’s 
progress.  The  potential  is  reduced  at  each  iteration.  There  is  no  restriction  on  either 
neighborhood  or  step-size  during  the  iterative  process  as  long  as  the  potential  is 
reduced.  The  greater  the  reduction  of  the  potential  function,  the  faster  the  conver¬ 
gence  of  the  algorithm.  Thus,  from  a  practical  point  of  view,  potential-reduction 
algorithms  may  have  an  advantage  over  path-following  algorithms  where  iterates 
are  confined  to  lie  in  certain  neighborhoods  of  the  central  path. 

o  o 

For  x^Tp  and  (y,  s)  e  Td  the  primal-dual  potential  function  is  defined  by 

n 

<A«+p(x,  s )  =  {n+  p)  log(x^s)  -  2_J  log (XjSj),  (5.14) 

j=  i 


where  p  >  0. 
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From  the  arithmetic  and  geometric  mean  inequality  (also  see  Exercise  10)  we 
can  derive  that 

n 

n  log(xrs)  -  ^  log(xjSj)  >  n  log  n. 

7=1 


Then 


<A«+P(x,  s)  =  p\og(xrs)  +  n\og(xTs)-^\og(XjSj)  >  p\og(xTs)  +  n  log  n.  (5.15) 

7=1 


Thus,  for  p  >  0,  ifsn+p(x,  s)  — >  -oo  implies  that  xTs  — »  0.  More  precisely,  we  have 
from  (5.15) 


s  <  exp 


^w+p(x,  s)-nlogn 
P 


Hence  the  primal-dual  potential  function  gives  an  explicit  bound  on  the  magnitude 
of  the  duality  gap. 

The  objective  of  this  method  is  to  drive  the  potential  function  down  toward  minus 
infinity.  The  method  of  reduction  is  a  version  of  Newton’s  method  (5.13).  In  this 
case  we  select  y  =  n/(n  +  p)  in  (5.13).  Notice  that  is  a  combination  of  a  predictor 
and  corrector  choice.  The  predictor  uses  y  -  0  and  the  corrector  uses  y  —  1.  The 
primal-dual  potential  method  uses  something  in  between.  This  seems  logical,  for 
the  predictor  moves  parallel  to  the  central  path  toward  a  lower  duality  gap,  and  the 
corrector  moves  perpendicular  to  get  close  to  the  central  path.  This  new  method 
does  both  at  once.  Of  course,  this  intuitive  notion  must  be  made  precise. 

For  p  >  V«,  there  is  in  fact  a  guaranteed  decrease  in  the  potential  function  by  a 
fixed  amount  6  (see  Exercises  12  and  13).  Specifically, 


i/'„+p(x+,  s+)  -  if/n+p(x,  s)  «  -6  (5.16) 

for  a  constant  6  >  0.2.  This  result  provides  a  theoretical  bound  on  the  number  of 
required  iterations  and  the  bound  is  competitive  with  other  methods.  However,  a 
faster  algorithm  may  be  achieved  by  conducting  a  line  search  along  direction  d  to 
achieve  the  greatest  reduction  in  the  primal-dual  potential  function  at  each  iteration. 
We  outline  the  algorithm  here: 

Step  1.  Start  at  a  point  (x0,  y0,  s0)  €  T  with  if/n+P(x o,  s0)  <  plog((s0)rx0)  + 
n  log  n+0(^fn  log  n)  which  is  determined  by  an  initiation  procedure,  as  discussed 
in  Sect.  5.7.  Set  p  >  yfn .  Set  k  -  0  and  y  =  n/(n  +  p).  Select  an  accuracy 
parameter  s  >  0. 

Step  2.  Set  (x,  s)  =  (x^,  s^)  and  compute  (dx,  dy,  ds)  from  (5.13). 

Step  3.  Let  x^+i  =  x^  +  fidx,  y^+i  =  +  fidy,  and  s^+i  =  +  fids  where 

a  =  argmin^n+p(xjk  +  adx,  sk  +  ads). 

a>0 


Let  k  =  k  +  1.  If  s-j^ 

Sq  x0 


<  s,  Stop.  Otherwise  return  to  Step  2. 


Step  4. 
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Theorem  2.  The  algorithm  above  terminates  in  at  most  0(p\og(n/s ))  iterations  with 

(sk)Txk  < 

(so)rx0  ”  £' 

Proof.  Note  that  after  k  iterations,  we  have  from  (5.16) 

if/n+P(xk,  Sk)  <  </w(x o,  s0)  -  k  •  6  <  p  log((s0)rx0)  +  n  log  n  +  0(  log  n)  -  k  •  A 
Thus,  from  the  inequality  (5.15), 

p  log(sjx^)  +  n  log  n  <  p  log(So  xo)  +  n  log  n  +  0(^  log  n)  -  k  •  6, 
or 

p(log(s[xi)  -  log(SoXo))  <-k-6+  0(gn\ogn). 

Therefore,  as  soon  as  k  >  0(p\og(n/s )),  we  must  have 

p(log(s[xi)  -  log(SoXo))  <  -plog(l/e), 


or 


Kxk 

SqXo 


<  £. 


Theorem  2  holds  for  any  p  >  sjn.  Thus,  by  choosing  p  =  V^,  the  iteration 
complexity  bound  becomes  <9(  ^fn  1  og(n/s)). 


Iteration  Complexity 

The  computation  of  each  iteration  basically  requires  solving  (5.13)  for  d.  Note  that 
the  first  equation  of  (5.13)  can  be  written  as 

Sdx  +  Xds  =  ypl  -  XS1 

where  X  and  S  are  two  diagonal  matrices  whose  diagonal  entries  are  components  of 
x  >  0  and  s  >  0,  respectively.  Premultiplying  both  sides  by  S-1  we  have 

dx  +  S-1Xds  =  ypS-1l  -  x. 

Then,  premultiplying  by  A  and  using  Adx  =  0,  we  have 

AS_1Xds  =  ypAS-1l  -  Ax  =  ypAS"1!  -  b. 


Using  ds  =  -Ardy  we  have 

(AS-1XAr)dy  =  b  -  ypAS_1l. 
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Thus,  the  primary  computational  cost  of  each  iteration  of  the  interior-point  algorithm 
discussed  in  this  section  is  to  form  and  invert  the  normal  matrix  AXS_1Ar,  which 
typically  requires  0(nm 2  +  m3)  arithmetic  operations.  However,  an  approximation 
of  this  matrix  can  be  updated  and  inverted  using  far  fewer  arithmetic  operations.  In 
fact,  using  a  rank-one  technique  (see  Chap.  10)  to  update  the  approximate  inverse  of 
the  normal  matrix  during  the  iterative  progress,  one  can  reduce  the  average  number 
of  arithmetic  operations  per  iteration  to  0(  yfnm2).  Thus,  if  the  relative  tolerance  s 
is  viewed  as  a  variable,  we  have  the  following  total  arithmetic  operation  complexity 
bound  to  solve  a  linear  program: 

Corollary.  Let  p  =  sjn .  Then,  the  algorithm  above  Theorem  2  terminates  in  at  most 

0(nm2  1  og(n/s))  arithmetic  operations. 


5.7  Termination  and  Initialization 

There  are  several  remaining  important  issues  concerning  interior-point  algorithms 
for  linear  programs.  The  first  issue  involves  termination.  Unlike  the  simplex  method 
which  terminates  with  an  exact  solution,  interior-point  algorithms  are  continuous 
optimization  algorithms  that  generate  an  infinite  solution  sequence  converging  to  an 
optimal  solution.  If  the  data  of  a  particular  problem  are  integral  or  rational,  an  argu¬ 
ment  is  made  that,  after  the  worst-case  time  bound,  an  exact  solution  can  be  rounded 
from  the  latest  approximate  solution.  Several  questions  arise.  First,  under  the  real 
number  computation  model  (that  is,  the  data  consists  of  real  numbers),  how  can 
we  terminate  at  an  exact  solution?  Second,  regardless  of  the  data’s  status,  is  there  a 
practical  test,  which  can  be  computed  cost-effectively  during  the  iterative  process,  to 
identify  an  exact  solution  so  that  the  algorithm  can  be  terminated  before  the  worse- 
case  time  bound?  Here,  by  exact  solution  we  mean  one  that  could  be  found  using 
exact  arithmetic,  such  as  the  solution  of  a  system  of  linear  equations,  which  can  be 
computed  in  a  number  of  arithmetic  operations  bounded  by  a  polynomial  in  n. 

The  second  issue  involves  initialization.  Almost  all  interior-point  algorithms 

o 

require  the  regularity  assumption  that  T  ±  0.  What  is  to  be  done  if  this  is  not  true? 
A  related  issue  is  that  interior-point  algorithms  have  to  start  at  a  strictly  feasible 
point  near  the  central  path. 


*  Termination 

Complexity  bounds  for  interior-point  algorithms  generally  depend  on  an  s  which 
must  be  zero  in  order  to  obtain  an  exact  optimal  solution.  Sometimes  it  is  advanta¬ 
geous  to  employ  an  early  termination  or  rounding  method  while  s  is  still  moderately 
large.  There  are  five  basic  approaches. 
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•  A  “purification”  procedure  finds  a  feasible  corner  whose  objective  value  is  at 
least  as  good  as  the  current  interior  point.  This  can  be  accomplished  in  strongly 
polynomial  time  (that  is,  the  complexity  bound  is  a  polynomial  only  in  the 
dimensions  m  and  n).  One  difficulty  is  that  there  may  be  many  non-optimal  ver¬ 
tices  close  to  the  optimal  face,  and  the  procedure  might  require  many  pivot  steps 
for  difficult  problems. 

•  A  second  method  seeks  to  identify  an  optimal  basis.  It  has  been  shown  that  if 
the  linear  program  is  nondegenerate,  the  unique  optimal  basis  may  be  identified 
early.  The  procedure  seems  to  work  well  for  some  problems  but  it  has  diffi¬ 
culty  if  the  problem  is  degenerate.  Unfortunately,  most  real  linear  programs  are 
degenerate. 

•  The  third  approach  is  to  slightly  perturb  the  data  such  that  the  new  program 
is  nondegenerate  and  its  optimal  basis  remains  one  of  the  optimal  bases  of  the 
original  program.  There  are  questions  about  how  and  when  to  perturb  the  data 
during  the  iterative  process,  decisions  which  can  significantly  affect  the  success 
of  the  effort. 

•  The  fourth  approach  is  to  guess  the  optimal  face  and  find  a  feasible  solution  on 
that  face.  It  consists  of  two  phases:  the  first  phase  uses  interior  point  algorithms  to 
identify  the  complementarity  partition  (P*,  Z*)  (see  Exercise  6),  and  the  second 
phase  adapts  the  simplex  method  to  find  an  optimal  primal  (or  dual)  basic  solu¬ 
tion  and  one  can  use  (P*,  Z*)  as  a  starting  base  for  the  second  phase.  This  method 
is  often  called  the  cross-over  method.  It  is  guaranteed  to  work  in  finite  time  and 
is  implemented  in  several  popular  linear  programming  software  packages. 

•  The  fifth  approach  is  to  guess  the  optimal  face  and  project  the  current  interior 
point  onto  the  interior  of  the  optimal  face.  See  Fig.  5.4.  The  termination  criterion 
is  guaranteed  to  work  in  finite  time. 

The  fourth  and  fifth  methods  above  are  based  on  the  fact  that  (as  observed  in  practice 
and  subsequently  proved)  many  interior-point  algorithms  for  linear  programming 
generate  solution  sequences  that  converge  to  a  strictly  complementary  solution  or 
an  interior  solution  on  the  optimal  face;  see  Exercise  8. 


Initialization 

Most  interior-point  algorithms  must  be  initiated  at  a  strictly  feasible  point.  The 
complexity  of  obtaining  such  an  initial  point  is  the  same  as  that  of  solving  the 
linear  program  itself.  More  importantly,  a  complete  algorithm  should  accomplish 
two  tasks:  (1)  detect  the  infeasibility  or  unboundedness  status  of  the  problem,  then 
(2)  generate  an  optimal  solution  if  the  problem  is  neither  infeasible  nor  unbounded. 
Several  approaches  have  been  proposed  to  accomplish  these  goals: 

•  The  primal  and  dual  can  be  combined  into  a  single  linear  feasibility  problem, 
and  a  feasible  point  found.  Theoretically,  this  approach  achieves  the  currently 
best  iteration  complexity  bound,  that  is,  0(  log(l  /s)).  Practically,  a  significant 
disadvantage  of  this  approach  is  the  doubled  dimension  of  the  system  of  equations 
that  must  be  solved  at  each  iteration. 
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•  The  big-M  method  can  be  used  by  adding  one  or  more  artificial  column(s)  and/or 
row(s)  and  a  huge  penalty  parameter  M  to  force  solutions  to  become  feasible 
during  the  algorithm.  A  major  disadvantage  of  this  approach  is  the  numerical 
problems  caused  by  the  addition  of  coefficients  of  large  magnitude. 

•  Phase  I-then-Phase  II  methods  are  effective.  A  major  disadvantage  of  this 
approach  is  that  the  two  (or  three)  related  linear  programs  must  be  solved 
sequentially. 


Fig.  5.4  Illustration  of  the  projection  of  an  interior  point  onto  the  optimal  face 


•  A  modified  Phase  I-Phase  II  method  approaches  feasibility  and  optimality  si¬ 
multaneously.  To  our  knowledge,  the  currently  best  iteration  complexity  bound 
of  this  approach  is  0(n\og(l  /  s)),  as  compared  to  0{sfn\og{l  /  s))  of  the  three 
above.  Other  disadvantages  of  the  method  include  the  assumption  of  non-empty 
interior  and  the  need  of  an  objective  lower  bound. 


The  HSD  Algorithm 

There  is  an  algorithm,  termed  the  Homogeneous  Self-Dual  Algorithm  that  over¬ 
comes  the  difficulties  mentioned  above.  The  algorithm  achieves  the  theoretically 
best  0(  s/n\og(l/s))  complexity  bound  and  is  often  used  in  linear  programming 
software  packages. 

The  algorithm  is  based  on  the  construction  of  a  homogeneous  and  self-dual  linear 
program  related  to  (LP)  and  (LD)  (see  Sect.  5.5).  We  now  briefly  explain  the  two 
major  concepts,  homogeneity  and  self-duality,  used  in  the  construction. 

In  general,  a  system  of  linear  equations  of  inequalities  is  homogeneous  if  the  right 
hand  side  components  are  all  zero.  Then  if  a  solution  is  found,  any  positive  mul¬ 
tiple  of  that  solution  is  also  a  solution.  In  the  construction  used  below,  we  allow  a 
single  inhomogeneous  constraint,  often  called  a  normalizing  constraint.  Karmarkar’s 
original  canonical  form  is  a  homogeneous  linear  program. 
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A  linear  program  is  termed  self-dual  if  the  dual  of  the  problem  is  equivalent  to 
the  primal.  The  advantage  of  self-duality  is  that  we  can  apply  a  primal-dual  interior- 
point  algorithm  to  solve  the  self-dual  problem  without  doubling  the  dimension  of 
the  linear  system  solved  at  each  iteration. 

The  homogeneous  and  self-dual  linear  program  (HSDP)  is  constructed  from  (LP) 
and  (LD)  in  such  a  way  that  the  point  x  =  1,  y  =  0,  r  =  1,  z  =  1,  6=1  is  feasible. 
The  primal  program  is 


(HSDP)  minimize 

subject  to  Ax 

_Ary 
bry  -cTx 
-bry  +crx 


(n  +  1)6 

-hr  +6  6  =  0, 

+ct  -c 6  >  0, 

+16  >  0, 

-zr  =  ~(n  +  1), 


y  free,  x  >  0,  r  >  0,  6  free; 


where 

b  =  b  -  Al,  c  =  c  —  1,  z  =  cTl  +  1.  (5.17) 

Notice  that  b,  c,  and  z  represent  the  “infeasibility”  of  the  initial  primal  point,  dual 
point,  and  primal-dual  “gap,”  respectively.  They  are  chosen  so  that  the  system  is 
feasible.  For  example,  for  the  point  x  =  1,  y  =  0,  r  =  1,  6  =  1,  the  last  equation 
becomes 

0  +  cTx  -  lTx  -  (crx  +  1)  =  -n  -  1. 

Note  also  that  the  top  two  constraints  in  (HSDP),  with  r  =  1  and  6  =  0,  represent 
primal  and  dual  feasibility  (with  x  >  0).  The  third  equation  represents  reversed 
weak  duality  (with  bry  >  crx)  rather  than  the  reverse.  So  if  these  three  equations 
are  satisfied  with  r  =  1  and  6  =  0  they  define  primal  and  dual  optimal  solutions. 
Then,  to  achieve  primal  and  dual  feasibility  for  x  =  1,  (y,  s)  =  (0,  1),  we  add  the 
artificial  variable  6.  The  fourth  constraint  is  added  to  achieve  self-duality. 

The  problem  is  self-dual  because  its  overall  coefficient  matrix  has  the  property 
that  its  transpose  is  equal  to  its  negative.  It  is  skew -symmetric. 

Denote  by  s  the  slack  vector  for  the  second  constraint  and  by  k  the  slack  scalar  for 
the  third  constraint.  Denote  by  Th  the  set  of  all  points  (y,  x,  r,  6 ,  s,  k)  that  are  feasi¬ 
ble  for  (HSDP).  Denote  by  T®  the  set  of  strictly  feasible  points  with  (x,  r,  s,  k)  >  0 
in  Th.  By  combining  the  constraints  (Exercise  14)  we  can  write  the  last  (equality) 
constraint  as 

lrx  +  lTs  +  T  +  K-(n  +  1)0=  (n+  1),  (5.18) 

which  serves  as  a  normalizing  constraint  for  (HSDP).  This  implies  that  for  0  <  6  <  1 
the  variables  in  this  equation  are  bounded. 

We  state  without  proof  the  following  basic  result. 

Theorem  1.  Consider  problems  (HSDP). 

(i)  (HSDP)  has  an  optimal  solution  and  its  optimal  solution  set  is  bounded. 

(ii)  The  optimal  value  of  (HSDP)  is  zero,  and 
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(y,  x,  r,  6 ,  s,  k )  g  Th  implies  that  (n  +  1)6  =  xT  s  +  tk. 
(iii)  There  is  an  optimal  solution  (y*,  x*,  r*,  =0,  s*,  a:*)  g  7^  such  that 


/  x*  +  s* 
\T*  +  k* 


>0, 


which  we  call  a  strictly  self-complementary  solution. 


Part  (ii)  of  the  theorem  shows  that  as  6  goes  to  zero,  the  solution  tends  toward 
satisfying  complementary  slackness  between  x  and  s  and  between  r  and  k.  Part  (iii) 
shows  that  at  a  solution  with  6  =  0 ,  the  complementary  slackness  is  strict  in  the  sense 
that  at  least  one  member  of  a  complementary  pair  must  be  positive.  For  example, 
x\S\  =0  is  required  by  complementary  slackness,  but  in  this  case  x\  =  0,  s\  =  0 
will  not  occur;  exactly  one  of  them  must  be  positive. 

We  now  relate  optimal  solutions  to  (HSDP)  to  those  for  (LP)  and  (LD). 

Theorem  2.  Let  (y*,  x*,  r*,  6*  -  0,  s*,  k*)  be  a  strictly -self  complementary  solution  for 

(HSDP). 

(i)  (LP)  has  a  solution  (feasible  and  bounded)  if  and  only  ifr*  >  0.  In  this  case,  x*/r*  is 
an  optimal  solution  for  (LP)  and  y  */r*,  s*/r*  is  an  optimal  solution  for  (LD). 

(ii)  (LP)  has  no  solution  if  and  only  if  k*  >  0.  In  this  case,  x*  /k*  or  y *  /k*  or  both  are 
certificates  for  proving  infeasibility:  if  cTx*  <  0  then  (LD)  is  infeasible;  if -bTy*  <  0 
then  (LP)  is  infeasible;  and  if  both  cTx*  <  0  and  -bry*  <  0  then  both  (LP)  and  (LD) 
are  infeasible. 


Proof.  We  prove  the  second  statement.  We  first  assume  that  one  of  (LP)  and  (LD) 
is  infeasible,  say  (LD)  is  infeasible.  Then  there  is  some  certificate  x  >  0  such  that 
Ax  =  0  and  crx  =  - 1.  Let  (y  =  0,  s  =  0)  and 


n  +  1 

lrx  +  \Ts  +  1 


>  0. 


Then  one  can  verify  that 


y  =  ay,  x  =  ax,  r  =  0,  6  =  0,  s  =  as,  k  =  a 


is  a  self-complementary  solution  for  (HSDP).  Since  the  supporting  set  (the  set  of 
positive  entries)  of  a  strictly  complementary  solution  for  (HSDP)  is  unique  (see 
Exercise  6),  a:*  >  0  at  any  strictly  complementary  solution  for  (HSDP). 

Conversely,  if  r*  =  0,  then  k*  >  0,  which  implies  that  crx*  -  bry*  <  0,  i.e., 
at  least  one  of  crx*  and  -bry*  is  strictly  less  than  zero.  Let  us  say  crx*  <  0.  In 
addition,  we  have 


Ax*  =  0,  Ary*  +  s*  =  0,  (x*)rs*  =  0  and  x*  +  s*  >  0. 


From  Farkas’  lemma  (Exercise  5),  x*/ k*  is  a  certificate  for  proving  dual  infeasibility. 
The  other  cases  hold  similarly.  I 


142 


5  Interior-Point  Methods 


To  solve  (HSDP),  we  have  the  following  theorem  that  resembles  the  the  central 
path  analyzed  for  (LP)  and  (LD). 

Theorem  3.  Consider  problem  (HSDP).  For  any  p  >  0,  there  is  a  unique  (y,  x,  r,  6,  s,  k ) 

o 

in  F'h,  such  that 


Moreover,  (x,  r)  =  (1,  1),  (y,  s,  k)  =  (0, 0,  1)  and  6  =  1  is  the  solution  with  p  =  1. 

Theorem  3  defines  an  endogenous  path  associated  with  (HSDP): 


Furthermore,  the  potential  function  for  (HSDP)  can  be  defined  as 


n 


where  p  >  0.  One  can  then  apply  the  interior-point  algorithms  described  earlier  to 
solve  (HSDP)  from  the  initial  point  (x,  r)  =  (1,  1),  (y,  s,  k)  =  (0, 1,  1)  and  6=1 
with  p  =  (xrs  +  tk)/(yi  +  1)  =  1. 

The  HSDP  method  outlined  above  enjoys  the  following  properties: 

•  It  does  not  require  regularity  assumptions  concerning  the  existence  of  optimal, 
feasible,  or  interior  feasible  solutions. 

•  It  can  be  initiated  at  x  =  1,  y  =  0  and  s  =  1,  feasible  or  infeasible,  on  the 
central  ray  of  the  positive  orthant  (cone),  and  it  does  not  require  a  big-M  penalty 
parameter  or  lower  bound. 

•  Each  iteration  solves  a  system  of  linear  equations  whose  dimension  is  almost  the 
same  as  that  used  in  the  standard  (primal-dual)  interior-point  algorithms. 

•  If  the  linear  program  has  a  solution,  the  algorithm  generates  a  sequence  that 
approaches  feasibility  and  optimality  simultaneously;  if  the  problem  is  infeasible 
or  unbounded,  the  algorithm  produces  an  infeasibility  certificate  for  at  least  one 
of  the  primal  and  dual  problems;  see  Exercise  5. 


5.8  Summary 

The  simplex  method  has  for  decades  been  an  efficient  method  for  solving  linear  pro¬ 
grams,  despite  the  fact  that  there  are  no  theoretical  results  to  support  its  efficiency. 
Indeed,  it  was  shown  that  in  the  worst  case,  the  method  may  visit  every  vertex  of 
the  feasible  region  and  this  can  be  exponential  in  the  number  of  variables  and  con¬ 
straints.  If  on  practical  problems  the  simplex  method  behaved  according  to  the  worst 
case,  even  modest  problems  would  require  years  of  computer  time  to  solve.  The 
ellipsoid  method  was  the  first  method  that  was  proved  to  converge  in  time  propor¬ 
tional  to  a  polynomial  in  the  size  of  the  program,  rather  than  to  an  exponential  in  the 
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size.  However,  in  practice,  it  was  disappointingly  less  fast  than  the  simplex  method. 
Later,  the  interior-point  method  of  Karmarkar  significantly  advanced  the  field  of  lin¬ 
ear  programming,  for  it  not  only  was  proved  to  be  a  polynomial-time  method,  but  it 
was  found  in  practice  to  be  faster  than  the  simplex  method  when  applied  to  general 
linear  programs. 

The  interior-point  method  is  based  on  introducing  a  logarithmic  barrier  function 
with  a  weighting  parameter  ji\  and  now  there  is  a  general  theoretical  structure  defin¬ 
ing  the  analytic  center,  the  central  path  of  solutions  as  /i  — >  0,  and  the  duals  of  these 
concepts.  This  structure  is  useful  for  specifying  and  analyzing  various  versions  of 
interior  point  methods. 

Most  methods  employ  a  step  of  Newton’s  method  to  find  a  point  near  the  central 
path  when  moving  from  one  value  of  /i  to  another.  One  approach  is  the  predictor- 
corrector  method,  which  first  takes  a  step  in  the  direction  of  decreasing  ji  and  then  a 
corrector  step  to  get  closer  to  the  central  path.  Another  method  employs  a  potential 
function  whose  value  can  be  decreased  at  each  step,  which  guarantees  convergence 
and  assures  that  intermediate  points  simultaneously  make  progress  toward  the  solu¬ 
tion  while  remaining  close  to  the  central  path. 

Complete  algorithms  based  on  these  approaches  require  a  number  of  other  fea¬ 
tures  and  details.  For  example,  once  systematic  movement  toward  the  solution  is 
terminated,  a  final  phase  may  move  to  a  nearby  vertex  or  to  a  non- vertex  point  on 
a  face  of  the  constraint  set.  Also,  an  initial  phase  must  be  employed  to  obtain  an 
feasible  point  that  is  close  to  the  central  path  from  which  the  steps  of  the  search 
algorithm  can  be  started.  These  features  are  incorporated  into  several  commercial 
software  packages,  and  generally  they  perform  well,  able  to  solve  very  large  linear 
programs  in  reasonable  time. 


5.9  Exercises 

1.  Using  the  simplex  method,  solve  the  program  (5.1)  and  count  the  number  of 
pivots  required. 

2.  Prove  the  volume  reduction  rate  in  Theorem  1  for  the  ellipsoid  method. 

3.  Develop  a  cutting  plane  method,  based  on  the  ellipsoid  method,  to  find  a  point 
satisfying  convex  inequalities 

fi(x)  <  0,  i  =  1,  . . . ,  m,  |x|2 3 4 5  <  E 2, 
where  fi’ s  are  convex  functions  of  x  in  C1 . 

o 

4.  Consider  the  linear  program  (5.5)  and  assume  that  Tp  -  {x  :  Ax  =  b,  x  >  0} 
is  nonempty  and  its  optimal  solution  set  is  bounded.  Show  that  the  dual  of  the 
problem  has  a  nonempty  interior. 

5.  (Farkas’  lemma)  Prove:  Exactly  one  of  the  feasible  sets  {x  :  Ax  =  b,  x  >  0} 
and  {y  :  yrA  <  0,  yrb  =  1 }  is  nonempty.  A  vector  y  in  the  latter  set  is  called  an 
infeasibility  certificate  for  the  former. 
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6.  (Strict  complementarity)  Consider  any  linear  program  in  standard  form  and  its 
dual  and  let  both  of  them  be  feasible.  Then,  there  always  exists  a  strictly  com¬ 
plementary  solution  pair,  (x\  y\  s*),  such  that 

x*s*  =  0  and  x*  +  s*  >  0  for  all  j. 

j  j  j  j  j 

Moreover,  the  supports  of  x*  and  s*,  P*  =  {j  :  x*  >  0}  and  Z*  =  {j  :  x*  >  0}, 
are  invariant  among  all  strictly  complementary  solution  pairs. 

7.  (Central  path  theorem)  Let  (x(//),  y (//),  s (//))  be  the  central  path  of  (5.9).  Then 
prove 

(a)  The  central  path  point  (x(//),  y (//),  s(/i))  is  bounded  for  0  <  jd  <  jj°  and  any 
given  0  <  /i°  <  oo. 

(b)  For  0  <  //  <  /i, 

c <  c Tx(jj)  and  bTy(jd')  >  bTy(jd). 

Furthermore,  if  x(//)  ^  x(//)  and  y (//)  ^  y(//), 

c <  ctx(jj)  and  bry(//')  >  bTy(ji). 

(c)  (x(/i),  y(yi),  s(/i))  converges  to  an  optimal  solution  pair  for  (LP)  and  (LD). 
Moreover,  the  limit  point  x(0)p*  is  the  analytic  center  on  the  primal  optimal 
face,  and  the  limit  point  s(0)z*  is  the  analytic  center  on  the  dual  optimal 
face,  where  (P*,  Z*)  is  the  strict  complementarity  partition  of  the  index  set 
{1,2,...,  n}. 

8.  Consider  a  primal-dual  interior  point  (x,  y,  s)  e  N(if)  where  rj  <  1.  Prove  that 
there  is  a  fixed  quantity  6  >  0  such  that 

xj  ^  6,  for  all  j  e  P* 


and 

Sj  >  6,  for  all  j  e  Z*, 

where  (P*,  Z*)  is  defined  in  Exercise  6. 

9.  (Potential  level  theorem)  Define  the  potential  level  set 

¥(<5)  :=  {(x,  y,  s)  e  f  :  t//n+p(x,  s)  <  5). 


Prove 

(a) 

XF(<51)  C  *F((52)  if  61  <82. 

(b)  For  every  6 ,  *¥(6)  is  bounded  and  its  closure  VP(^)  has  non-empty  intersec¬ 
tion  with  the  solution  set. 
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10.  Given  0  <  x,  0  <  s  e  En,  show  that 


n  log(xrs)  -  ^  log(xjSj)  >  n  log  n 
;'=i 


and 


s  <  exp 


<An+^(x,  S)-^l0gn 
P 


1 1 .  (Logarithmic  approximation)  If  d  e  En  such  that  |d|oo  <  1  then 


lTd  >  V  log(l  +  d,)  >  lrd 
1=1 


Idl2 

2(1  -  |d|oo)  * 


[Noteilf  d  =  (d\,  d2,  . . .  dn)  then  |d|oo  =  max/jdj}.] 

12.  Let  the  direction  (dx,  dy,  ds)  be  generated  by  system  (5.13)  with  y  -  n/(n  +  p) 
and  /d  =  xTs/n,  and  let  the  step  size  be 


6  Vmin(Xs) 

Q'=|(XS)-1/2(^1-Xs)|’ 
where  6  is  a  positive  constant  less  than  1 .  Let 


(5.20) 


x+  =  x  +  adx,  y+  =  y  +  crdy,  and  s+  =  s  +  ads. 


Then,  using  Exercise  11  and  the  concavity  of  the  logarithmic  function  show 
(x+,  y+,  s+)  £  T  and 


|/',1+P(x+,  s+)  -  ^„+p(x,  s) 

<  -0  V min(Xs)  |(Xs)-‘/2(l  -  ^±^Xs)|  + 

13.  Let  v  =  Xs  in  Exercise  12.  Prove 

Vmin(v)|V-1/2(l  -  ^v)|  >  VV4> 
where  V  is  the  diagonal  matrix  of  v.  Thus,  the  two  exercises  imply 

2(1-0) 


i/o„+p(x+,  s+)  -  tA„+p(x,  s)  <  -e  V3/4 


for  a  constant  5 .  One  can  verify  that  6  >  0.2  when  6  -  0.4. 

14.  Prove  property  (5.18)  for  (HDSP). 

15.  Prove  Theorem  1 


146 


5  Interior-Point  Methods 


References 

5 . 1  Computation  and  complexity  models  were  developed  by  a  number  of  scien¬ 
tists;  see,  e.g.,  Cook  [C5],  Hartmanis  and  Stearns  [H5]  and  Papadimitriou 
and  Steiglitz  [P2]  for  the  bit  complexity  models  and  Blum  et  al.  [B21]  for 
the  real  number  arithmetic  model.  For  a  general  discussion  of  complexity 
see  Vavasis  [V4].  For  a  comprehensive  treatment  which  served  as  the  basis 
for  much  of  this  chapter,  see  Ye  [Y3]. 

5.2  The  Klee  Minty  example  is  presented  in  [K5].  Much  of  this  material  is 
based  on  a  teaching  note  of  Cottle  on  Linear  Programming  taught  at  Stan¬ 
ford  [C6].  Practical  performances  of  the  simplex  method  can  be  seen  in 
Bixby  [B 1 8] .  The  simplex  method  efficiency  for  the  Markov  Decision 
Process  is  due  to  Ye  [269] . 

5.3  The  ellipsoid  method  was  developed  by  Khachiyan  [K4];  more  develop¬ 
ments  of  the  ellipsoid  method  can  be  found  in  Bland,  Goldfarb  and  Todd 
[B20]. 

5.3  The  analytic  center  for  a  convex  polyhedron  given  by  linear  inequalities 
was  introduced  by  Huard  [H12],  and  later  by  Sonnevend  [S8].  The  barrier 
function  was  introduced  by  Frisch  [FI 9].  The  central  path  was  analyzed 
in  McLinden  [M3],  Megiddo  [M4],  and  Bayer  and  Lagarias  [B3,  B4],  Gill 
et  al.  [G5]. 

5.5  Path-following  algorithms  were  first  developed  by  Renegar  [Rl].  A  primal 
barrier  or  path-following  algorithm  was  independently  analyzed  by  Gon- 
zaga  [G13].  Both  Gonzaga  [G13]  and  Vaidya  [VI]  extended  the  rank-one 
updating  technique  [K2]  for  solving  the  Newton  equation  of  each  itera¬ 
tion,  and  proved  that  each  iteration  uses  0(n2  5)  arithmetic  operations  on 
average.  Kojima,  Mizuno  and  Yoshise  [K6]  and  Monteiro  and  Adler  [M7] 
developed  a  symmetric  primal-dual  path-following  algorithm  with  the  same 
iteration  and  arithmetic  operation  bounds. 

5. 6-5. 7  Predictor-corrector  algorithms  were  developed  by  Mizuno  et  al.  [M6]. 
A  more  practical  predictor-corrector  algorithm  was  proposed  by  Mehrotra 
[M5]  (also  see  Lustig  et  al.  [LI 9]  and  Zhang  and  Zhang  [Z3]).  Mehrotra’s 
technique  has  been  used  in  almost  all  linear  programming  interior-point 
implementations.  A  primal  potential  reduction  algorithm  was  initially  pro¬ 
posed  by  Karmarkar  [K2] .  The  primal-dual  potential  function  was  proposed 
by  Tanabe  [T2]  and  Todd  and  Ye  [T5].  The  primal-dual  potential  reduction 
algorithm  was  developed  by  Ye  [Yl],  Freund  [FI 8],  Kojima,  Mizuno  and 
Yoshise  [K7],  Goldfarb  and  Xiao  [Gil],  Gonzaga  and  Todd  [G14],  Todd 
[T4],  Tunnel  [T10],  Tutuncu  [Til],  and  others.  The  homogeneous  and  self¬ 
dual  embedding  method  can  be  found  in  Ye  et  al.  [Y2],  Luo  et  al.  [LI 8], 
Andersen  and  Ye  [A5],  and  many  others.  It  is  also  implemented  in  most 
linear  programming  software  packages  such  as  SEDUMI  of  Sturm  [SI  1]. 


References 


147 


5. 1-5.7  There  are  several  comprehensive  text  books  which  cover  interior-point 
linear  programming  algorithms.  They  include  Bazaraa,  Jarvis  and  Sherali 
[B6],  Bertsekas  [B12],  Bertsimas  and  Tsitsiklis  [B13],  Cottle  [C6],  Cottle, 
Pang  and  Stone  [C7],  Dantzig  and  Thapa  [D9,  DIO],  Fang  and  Puthen- 
pura  [F2],  den  Hertog  [H6],  Murty  [Ml 2],  Nash  and  Sofer  [Nl],  Nesterov 
[N2],  Roos  et  al.  [R4],  Renegar  [R2],  Saigal  [SI],  Vanderebei  [V3],  and 
Wright  [W8]. 


Chapter  6 

Conic  Linear  Programming 


6.1  Convex  Cones 

Conic  Linear  Programming,  hereafter  CLP,  is  a  natural  extension  of  Linear 
programming  (LP).  In  LP,  the  variables  form  a  vector  which  is  required  to  be  com¬ 
ponentwise  nonnegative,  while  in  CLP  they  are  points  in  a  pointed  convex  cone  (see 
Appendix  B.l)  of  an  Euclidean  space,  such  as  vectors  as  well  as  matrices  of  finite 
dimensions.  For  example,  Semidefinite  programming  (SDP)  is  a  kind  of  CLP,  where 
the  variable  points  are  symmetric  matrices  constrained  to  be  positive  semidefinite. 
Both  types  of  problems  may  have  linear  equality  constraints  as  well.  Although  CLPs 
have  long  been  known  to  be  convex  optimization  problems,  no  efficient  solution 
algorithm  was  known  until  about  two  decades  ago,  when  it  was  discovered  that 
interior-point  algorithms  for  LP  discussed  in  Chap.  5,  can  be  adapted  to  solve  cer¬ 
tain  CLPs  with  both  theoretical  and  practical  efficiency.  During  the  same  period,  it 
was  discovered  that  CLP,  especially  SDP,  is  representative  of  a  wide  assortment  of 
applications,  including  combinatorial  optimization,  statistical  computation,  robust 
optimization,  Euclidean  distance  geometry,  quantum  computing,  optimal  control, 
etc.  CLP  is  now  widely  recognized  as  a  powerful  mathematical  computation  model 
of  general  importance. 

First,  we  illustrate  several  convex  cones  popularly  used  in  conic  linear  optimiza¬ 
tion. 

Example  1.  The  followings  are  all  (closed)  convex  cones. 

•  The  ft- dimensional  non-negative  orthant,  =  {x  e  En  :  x  >  0},  is  a  convex 
cone. 

•  The  set  of  all  ft-dimensional  symmetric  positive  semidefinite  matrices,  denoted 
by  Sn+,  is  a  convex  cone,  called  the  positive  semidefinite  matrix  cone.  When  X  is 
positive  semidefinite  (positive  definite),  we  often  write  the  property  as  X  >  0)0. 
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•  The  set  {(u;x)  £  En+l  :  u  >  |x|p}  is  a  convex  cone  in  En+l ,  called  the  p-order 
cone  where  1  <  p  <  oo.  When  p  -  2,  the  cone  is  called  second-order  cone  or 
“Ice-cream”  cone. 

Sometimes,  we  use  the  notion  of  conic  inequalities  P  >k  Q  or  Q  <k  P,  in  which 
cases  we  simply  mean  P  -  Q  £  K. 

Suppose  A  and  B  are  k  x  n  matrices.  We  define  the  inner  product 

A  •  B  =  trace(ArB)  =  ^  aijbij- 

When  k  =  1,  they  become  ^-dimensional  vectors  and  the  inner  product  is  the  stan¬ 
dard  dot  product  of  two  vectors.  In  SDP,  this  definition  is  almost  always  used  for  the 
case  where  the  matrices  are  both  square  and  symmetric.  The  matrix  norm  associated 
with  the  inner  product  is  called  Frobenius  norm : 

|X|/=  Vx.x. 

For  a  cone  K ,  the  dual  of  K  is  the  cone 

K*  :=  {Y  :  X  •  Y  >  0  for  all  X  €  K }. 

It  is  not  difficult  to  see  that  the  dual  cones  of  the  first  two  cones  in  Example  1  are  all 
them  self,  respectively;  while  the  dual  cone  of  the  p-order  cone  is  the  g-order  cone 
where 

1  1  _  , 

—  +  —  —  1 . 

p  q 

One  can  see  that  when  p  -  2,  q  -  2  as  well;  that  is,  they  are  both  2-order  cones.  For 
a  closed  convex  cone  K ,  the  dual  of  the  dual  cone  is  itself. 


6.2  Conic  Linear  Programming  Problem 


Now  let  C  and  A/,  i  =  1,2,  . . . ,  m,  be  given  matrices  of  Ekxn ,  b  £  Em ,  and  7C  be 
a  closed  convex  cone  in  EkXn.  And  let  X  be  an  unknown  matrix  of  EkXn.  Then,  the 
standard  form  (primal)  conic  linear  programming  problem  is 

(CLP)  minimize  C  •  X 

subject  to  A/  •  X  =  bt,  i  -  1,2,  . . . ,  m,  X  e  K.  (6.1) 

Note  that  in  CLP  we  minimize  a  linear  function  of  the  decision  matrix  constrained 
in  cone  K  and  subject  to  linear  equality  constraints. 
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For  convenience,  we  define  an  operator  from  a  symmetric  matrix  to  a  vector: 


&X  = 


f  Ai  *X\ 

a2.x 


Lm 


X 


(6.2) 


Then,  CLP  can  be  written  in  a  compact  form: 

(CLP)  minimize  C  •  X 

subject  to  2RX  =  b,  X  e  K. 

When  cone  K  is  the  non-negative  orthant  is",  CLP  reduces  to  linear  programming 
(LP)  in  the  standard  form,  where  2ft  becomes  the  constraint  matrix  A.  When  K  is  the 
positive  semidefinite  cone  Sn+,  CLP  is  called  semidefinite  programming  (SDP);  and 
when  K  is  the  p-order  cone,  it  is  called  p-order  cone  programming.  In  particular, 
when  p  =  2,  the  model  is  called  second-order  cone  programming  (SOCP).  Fre¬ 
quently,  we  write  variable  X  in  (CLP)  as  x  if  it  is  indeed  a  vector,  such  as  when  K  is 
the  nonnegative  orthant  or  p-o rder  cone. 

One  can  see  that  the  problem  (SDP)  (that  is,  (6.1)  with  the  semidefinite  cone) 
generalizes  classical  linear  programming  in  standard  form: 

np 

minimize  c  x, 
subject  to  Ax  =  b,  x  >  0. 

Define  C  =  Diag[ci, 

^2?  • •  •  ?  Cn  ],  and  let  A/  =  Diag[«/i,  at 2,  . . .,  ain\  for  i  = 
1,2,  . . .  m.  The  unknown  is  the  nxn  symmetric  matrix  X  which  is  constrained  by 
X  >  0.  Since  the  trace  of  C  •  X  and  A,  •  X  depend  only  on  the  diagonal  elements 
of  X,  we  may  restrict  the  solutions  X  to  diagonal  matrices.  It  follows  that  in  this 
case  the  SDP  problem  is  equivalent  to  a  linear  program,  since  a  diagonal  matrix 
is  positive  semidefinite  is  and  only  if  its  all  diagonal  elements  are  nonnegative. 

One  can  further  see  the  role  of  cones  in  the  following  examples. 

Example  1.  Consider  the  following  optimization  problems  with  three  variables. 

•  This  is  a  linear  programming  problem  in  standard  form: 

minimize  2x\  +  X2  +  V3 
subject  to  x\  +  X2  +  V3  =  1, 

(x\;x2;x3)  >  0. 

•  This  is  a  semidefinite  programming  problem  where  the  dimension  of  the  matrix 
is  two: 

minimize  2x\  +  x2  +  V3 
subject  to  x\  +  x2  +  V3  =  1, 

*1  X2 
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Let 


and 


1  .5 
.5  1 


Then,  the  problem  can  be  written  in  a  standard  SDP  form 


minimize  C  •  X 

subject  to  Ai  •  X  =  1,  X  e  S2+. 

•  This  is  a  second- order  cone  programming  problem: 


minimize  2x\  +  +  *3 

subject  to  x\  +  X2  +  *3  =  1, 

+  X3  <  X\. 

We  present  several  application  examples  to  illustrate  the  flexibility  of  this  formu¬ 
lation. 

Example  2  (Binary  Quadratic  Optimization).  Consider  a  binary  quadratic  maxi¬ 
mization  problem 


maximize  xrQx  +  2crx 

subject  to  xj  =  {1,  -1},  for  all  j  -  1,  . . . ,  12, 

which  is  a  difficult  nonconvex  optimization  problem.  The  problem  can  be  rewrit¬ 
ten  as 


>k  •  • 

X 

T 

\  0  cl 

X 

z  =  maximize 

1 

1 - 

O 

HI 

O 

1 _ 

1 

subject  to  (xj)  =  1,  for  all  j  =  1,  . . . ,  12, 


which  can  be  also  written  as  a  homogeneous  quadratic  binary  problem 


z  =  maximize 


Q  c 

cT  0 


X 

%n+ 1 


subject  to  I 


x 

%n+ 1 


X 

%n+ 1 


iT 

X 

%n+ 1 

iT 

=  1 ,  for  all  j 


—  1 ,  .  .  .  ,  12  +  1 , 


where  I7  is  the  (12+  l)x(i2+  1)  matrix  whose  components  are  all  zero  except  at  the 
jth  position  on  the  main  diagonal  where  it  is  1 .  Let  (x*;  x*  be  an  optimal  solution 
for  the  homogeneous  problem.  Then,  one  can  see  that  x* /x*n+l  would  be  an  optimal 
solution  to  the  original  problem. 

r  nr 


Since 


X 

%n+ 1 


X 

%n+l 


forms  a  positive-semidefinite  matrix  (with  rank  equal  to  1), 


a  semidefinite  relaxation  of  the  problem  is  defined  as 
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z5 DP  EE 


maximize 


Q  c 

cT  0 


•  Y 


subject  to  I ,  •  Y  =  1,  for  all  /=  I 1 


(6.3) 


Y  €  S' 


11+  1 
+ 


where  the  symmetric  matrix  Y  has  dimension  n  +  1.  Obviously,  zSDP  is  a  upper 
bound  of  z* ,  since  the  rank- 1  requirement  is  not  enforced  in  the  relaxation. 

Let’s  see  how  to  use  the  relaxation.  For  simplicity,  assuming  0,  it  has  been 

shown  that  in  many  cases  of  this  problem  an  optimal  SDP  solution  either  constitutes 
an  exact  solution  or  can  be  rounded  to  a  good  approximate  solution  of  the  original 
problem.  In  the  former  case,  one  can  show  that  a  rank- 1  optimal  solution  matrix  Y 
exists  for  the  semidefinite  relaxation  and  it  can  be  found  by  using  a  rank-reduction 
procedure.  For  the  latter  case,  one  can,  using  a  randomized  rank-reduction  procedure 
or  the  principle  components  of  Y,  find  a  rank- 1  feasible  solution  matrix  Y  such  that 

•  Y  >  a  ■  ZSDP  >  a  ■  Z* 

1.  Thus,  one  can  find  a  feasible  solution  to  the 
original  problem  whose  objective  value  is  no  less  than  a  factor  a  of  the  true  maximal 
objective  cost. 

Example  3  (Sensor  Localization).  This  problem  is  that  of  determining  the  location 
of  sensors  (for  example,  several  cell  phones  scattered  in  a  building)  when  measure¬ 
ments  of  some  of  their  separation  Euclidean  distances  can  be  determined,  but  their 
specific  locations  are  not  known.  In  general,  suppose  there  are  n  unknown  points 
xj  £  Ed ,  j  =  1,  . . . ,  n.  We  consider  an  edge  to  be  a  path  between  two  points, 
say,  i  and  j.  There  is  a  known  subset  Ne  of  pairs  (edges)  i  j  for  which  the  separation 
distance  dij  is  known.  For  example,  this  distance  might  be  determined  by  the  signal 
strength  or  delay  time  between  the  points.  Typically,  in  the  cell  phone  example,  Ne 
contains  those  edges  whose  lengths  are  small  so  that  there  is  a  strong  radio  signal. 
Then,  the  localization  problem  is  to  find  locations  Xj,  j  =  1,  . . . ,  n,  such  that 

|x;  -  Xyj2  =  (dij)2,  for  all  (i,  j)  e  Ne, 

subject  to  possible  rotation  and  translation.  (If  the  locations  of  some  of  the  sensors 
are  known,  these  may  be  sufficient  to  determine  the  rotation  and  translation  as  well.) 

Let  X  =  [xi  x2  . . .  xn]  be  the  dxn  matrix  to  be  determined.  Then 

|x;  -  x/  =  (e,-  -  e/)rXrX(e/  -  e,-), 

where  e*  £  En  is  the  vector  with  1  at  the  ith  position  and  zero  everywhere  else.  Let 
Y  =  XrX.  Then  the  semidefinite  relaxation  of  the  localization  problem  is  to  find  Y 
such  that 


Q  c 

cT  0 


for  a  provable  factor  0  <  a  < 
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(e,  -  e;)(e,  -  e,-)r  •  Y  =  (du)2,  for  all  ( i ,  j )  e 

Y  >  0. 

This  problem  is  one  of  finding  a  feasible  solution;  the  objective  function  is  null.  But 
if  the  distance  measurements  have  noise,  one  can  add  additional  variables  and  an 
error  objective  to  minimize.  For  example, 

minimize  \Zij\ 

subject  to  (e;  -  e;)(e,  -  c;)r  •  Y  +  z.,j  =  ( d,j  f ,  for  all  (/,  j)  e  Ne, 

Y  >  0. 

This  problem  can  be  converted  into  a  conic  linear  program  with  mixed  nonnegative 
orthant  and  semidefinite  cones. 

Under  certain  graph  structure,  an  optimal  SDP  solution  Y  of  the  formulation  would 
be  guaranteed  rank -d  so  that  it  constitutes  an  exact  solution  of  the  original  problem. 
Also,  in  general  Y  can  be  rounded  to  a  good  approximate  solution  of  the  original 
problem.  For  example,  one  can,  using  a  randomized  rank-reduction  procedure  or  the 

/V 

d  principle  components  of  Y,  find  a  rank -d  solution  matrix  Y. 


6.3  Farkas’  Lemma  for  Conic  Linear  Programming 

We  first  introduce  the  notion  of  “interior”  of  cones. 

Definition  1.  We  call  X  an  interior  point  of  cone  K  if  and  only  if,  for  any  point 
Yer,Y#X  =  0  implies  Y  =  0. 

o 

The  set  of  interior  points  of  K  is  denoted  by  K . 

Theorem  1.  The  interior  of  the  followings  convex  cones  are  given  as: 

•  The  interior  of  the  non-negative  orthant  cone  is  the  set  of  all  vectors  where  every  entry 
is  positive. 

•  The  interior  of  the  positive  semidefinite  cone  is  the  set  of  all  positive  definite  matrices. 

•  The  interior  of  p-order  cone  is  the  set  of{(u;  x)  £  En+l  :  u  >  |x|p}. 

We  give  a  sketch  of  the  proof  for  the  second  order  cone,  i.e.,  p  -  2.  Let  (fi;  x)  ^  0 
be  any  second-order  cone  point  but  u  =  |x|.  Then,  we  can  choose  a  dual  cone  (also 
the  second-order  cone)  point  (v;  y)  such  that 

v  =  au,  y  =  -ax. 


for  a  positive  a.  Note  that 


( u ;  x)  •  (v;  y)  =  av2  -  ojx|2  =  0. 
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Then,  one  can  let  a  — >  oo  so  that  (v;  y)  cannot  be  bounded. 

Now  let  ( u ;  x)  be  any  given  second-order  cone  point  with  u  >  |x|.  We  like  to  prove 
that,  for  any  dual  cone  (also  the  second-order  cone)  point  (v;  y), 

(m;  x)  •  (v;  y)  =  0 

implies  that  (v;  y)  is  bounded.  Note  that 

0  =  ( u ;  x)  •  (v;  y)  =  uv  +  x  •  y 


or 


uv  <  -x*y  <  |x||y|. 


If  v  =  0,  we  must  have  y  =  0;  otherwise, 


u  <  |x| |y|/v  <  |x 


which  contradicts  u>  |x|. 

We  leave  the  proof  of  the  following  proposition  as  an  exercise. 

o 

Proposition  1.  Let  X  eK  and  Y  e  K*.  Then  For  any  nonnegative  constant  k,  Y  •  X  <  k 
implies  that  Y  is  bounded. 

Let  us  now  consider  the  feasible  region  of  (CLP)  (6.1): 

T  :=  {X  :  JIX  =  b,  X  e  Kj; 
where  the  interior  of  the  feasible  region  is 

r:=  {X  :  LAX  =  b,  X  ek}. 

If  T  is  empty  with  K  -  En+,  from  Farkas’  lemma  for  linear  programming,  a  vector 
y  £  Em ,  with  y T A  <  0  and  yrb  >  0,  always  exists  and  is  called  an  infeasibility 
certificate  for  the  system  {x  :  Ax  =  b,  x  >  0}. 

Does  this  alternative  relations  hold  for  K  being  a  general  closed  convex  one?  Let 
us  rigorousize  the  question.  Let  us  define  the  reverse  operator  of  (6.2)  from  a  vector 
to  a  matrix: 

m 

yTft  =  Ej  A (6-4) 

i=l 

Note  that,  by  the  definition,  for  any  matrix  X  £  Ekxn 

yTj[  •  x  =  yT{J[X\ 

that  is,  the  association  property  holds.  Also,  (y7  Ji)T  =  ^l7y,  that  is,  the  transpose 
operation  applies  here  as  well. 

Then,  the  question  becomes:  when  T  is  empty,  does  there  exist  a  vector  y  e 
Em  such  that  -yT dR  £  K*  and  y7b  >  0?  Similarly,  one  can  ask:  when  set 
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{y  :  CT  -  yT2R  e  K)  is  empty,  does  there  exist  a  matrix  XgT  such  that  2RX  =  0 
and  C  •  X  <  0?  Note  that  the  answer  to  the  second  question  is  also  “yes”  when 
K  =  En+. 

Example  1.  The  answer  to  either  question  is  “not  necessarily”;  see  example  below. 

•  For  the  first  question,  consider  K  -  S\  and 


A 


l 


0  1 
1  0 


and 


•  For  the  second  question,  consider  K  =  S2  and 

1  0 
00  ' 

However,  if  the  data  set  2ft  satisfies  additional  conditions,  the  answer  would  be 
“yes”;  see  theorem  below. 

Theorem  2  (Farkas’  Lemma  for  CLP).  We  have 

•  Consider  set 

Tp  :=  {X  :  PCX  =  b,  X  e  K). 

o  oT 

Suppose  that  there  exists  a  vector  y  such  that  -  y  P\  eK*  .  Then, 

1.  Set  C  {PCX  e  Em  :  X  e  K)  is  a  closed  convex  set ; 

2.  T'p  has  a  ( feasible )  solution  if  and  only  if  set  {y  :  -yTPi  e  K*,  yrb  >  0}  has  no 
feasible  solution. 

•  Consider  set 

Ti  :=  {y  :  CT  -  yTW  e  K). 

O  o  o 

Suppose  that  there  exists  a  vector  X^K*  such  that  J{  X-  0.  Then, 

1.  Set  C  :=  {S  -  yT  SR  :  closed  convex  set; 

2.  Ed  has  a  ( feasible )  solution  if  and  only  if  set  {X  :  PCX  =  0,  X  e  K*,  C  •  X  <  0}  has 
no  feasible  solution. 


c  = 


0  1 
1  0 


and  A\  = 


Proof.  We  prove  the  first  statement  of  the  theorem.  We  prove  the  first  part.  It  is 
clear  that  C  is  a  convex  set.  To  prove  that  C  is  a  closed  set,  we  need  to  show  that 
if  yk  :=  2RXk  e  Em  for  Xk  e  K,  k  =  1, . . .,  converges  to  a  vector  y,  then  y  e  Cor 
there  is  X  e  K  such  that  y  :=  PIX.  Without  loss  of  generality,  we  assume  that  yk  is 
a  bounded  sequence.  Then,  we  have,  for  a  positive  constant  c, 

c  >  -ihTyk  =  ~(hT(Xixk)  =  -dfft  •  xk,  ik. 
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Since  -(J)T  31  eK*,  by  definition,  the  sequence  of  Xk  is  also  bounded.  Then  there  is 
at  least  an  accumulate  point  X  e  K  because  K  is  a  closed  cone.  Thus,  we  must  have 
y  :=  3\X. 

We  now  prove  the  second  part.  If  Tp  has  a  feasible  solution  X.  Then,  let  y  make 
-yr^l  €  K* 

-yTb  =  -yT(3lX)  =  -yT3l  •  X  >  0. 

Thus,  it  must  be  true  yrb  <  0,  that  is,  {y  :  -yT  3i  e  K*,  yrb  >  0}  must  be  empty. 

On  the  other  hand,  let  Tp  has  no  feasible  solution,  or  equivalently,  b  £  C.  We 
now  show  that  {y  :  -yT 3\  e  K*,  yrb  >  0}  must  be  nonempty. 

Since  C  is  a  closed  convex  set,  from  the  separating  hyperplane  theorem,  there 
must  exist  aye  Em  such  that 


yrb  >  yTy,  Vy  e  C, 


or,  from  y  =  31X,  X  e  K,  we  have 

yrb  >  yT(JlX)  =  yTJ[  •  X,  VXg  K. 

That  is,  yT^l  •  X  is  bounded  above  for  all  X  e  K. 

Immediately,  we  see  yrb  >  0  since  0  e  K.  Next,  it  must  be  true  -yT 31  e  K*. 
Otherwise,  we  must  be  able  to  find  an  X  e  K  such  that  -yT3i  •  X  <  0  by  the 
definition  of  K  and  its  dual  K* .  For  any  positive  constant  a  we  maintain  aX  e  K 
and  let  a  go  to  oo.  Then,  yT 3\  •  (oX)  goes  to  oo,  contradicting  the  fact  that  yT 3\  •  X 
is  bounded  above  for  all  X  e  K.  Thus,  y  is  a  feasible  solution  in  {y  :  -yT3l  e 
K*,  yrb  >  0}.l 


Note  that  C  may  not  be  a  closed  set  if  the  interior  condition  of  Theorem  2  is  not 
met.  Consider  Ai ,  A2  and  b  in  Example  1,  and  we  have 


Ai  •  X 

a2*x 


XeS 


Let 


X^  = 


b 

1  k 


e  sz+,  vk  =  1 


Then  we  see 

1 

k 

2 


As  k  — >  00  we  see  yk  converges  b,  but  b  is  not  in  C. 
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6.4  Conic  Linear  Programming  Duality 

Because  conic  linear  programming  is  an  extension  of  classical  linear  programming, 
it  would  seem  that  there  is  a  natural  dual  to  the  primal  problem,  and  that  this  dual 
is  itself  a  conic  linear  program.  This  is  indeed  the  case,  and  it  is  related  to  the  primal 
in  much  the  same  way  as  primal  and  dual  linear  programs  are  related.  Furthermore, 
the  primal  and  dual  together  lead  to  the  formation  a  primal-dual  solution  method, 
which  is  discussed  later  in  this  chapter. 

The  dual  of  the  (primal)  CLP  (6.1)  is 

(CLD)  maximize  yrb 

subject  to  yiA[  +  S  =  Cr,  S  e  K*.  (6.5) 

On  written  in  a  compact  form: 

(CLD)  maximize  y  b 

subject  to  yTJl  +  S  =  Cr,  S  e  K*. 

Notice  that  S  represents  a  slack  matrix,  and  hence  the  problem  can  alternatively  be 
expressed  as 


rT1 

maximize  y  b 

Zm  rri 

ytAi  <K*  C  .  (6.6) 

Recall  that  conic  inequality  Q  <k  P  means  P  -  Q  e  K. 

Again,  just  like  linear  programming,  the  dual  of  (CLD)  will  be  (CLP),  and  they 
form  a  primal  and  dual  pair.  Whichever  is  the  primal,  then  the  other  will  be  the  dual. 
We  would  see  more  primal  and  dual  relations  later. 

Example  1.  Here  are  dual  problems  to  the  three  instances  in  Example  1  where  y  is 
just  a  scalar. 

•  The  dual  to  the  linear  programming  instance: 

maximize  y 

subject  to  y(l,  1,  l)  +  (^i,  s2,  s3)  =  (2,  1,  1), 
s  =  (su  s2,  S3)  e  K*  =  E\. 

•  The  dual  to  semidefinite  programming  instance: 

maximize  y 

subject  to  yAi  +  S  =  C, 

S  e  K*  =  S2+ , 
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where  recall 


and 


1  .5 
.5  1 


•  The  dual  to  the  second-order  cone  instance: 


maximize  y 

subject  to  y(l,  1,  1)  +  Qi,  s2,  s3)  =  (2,  1,  1), 

+  ^3  <  si,  or  s  =  (s\,  s2,  S3)  in  second-order  cone. 

Let  us  consider  a  couple  of  more  dual  examples  of  the  problems  we  posted  earlier. 


Example  2  ( The  Dual  of  Binary  Quadratic  Maximization).  Consider  the  semidefi- 
nite  relaxation  (6.3)  for  the  binary  quadratic  maximization  problem.  It’s  dual  is 


minimize 


subject  to 


Q  c 

cr  0 


S  >  0. 


Note  that 

Q  c 

cT  0 

is  exactly  the  Hessian  matrix  of  the  Lagrange  function  of  the  quadratic  maximization 
problem;  see  Chap.  11.  Therefore,  there  is  a  close  connection  between  the  Lagrange 
and  conic  dualities.  The  problems  is  to  find  a  diagonal  matrix  Diag[(yi; . . .  \yn+ 1)] 
such  that  the  Lagrange  Hessian  is  positive  semidefinite  and  its  sum  of  diagonal 
elements  is  minimized. 

Example  3  ( The  Dual  of  Sensor  Localization).  Consider  the  semidefinite  program¬ 
ming  relaxation  for  the  sensor  localization  problem  (with  no  noises).  It’s  dual  is 

maximize  >  y,  7 

subject  to  V  y,y(e,-  -  e;)(e,  -  e,)7”  +  S  =  0,  S  >  0. 

Here,  yij  represents  an  internal  force  or  tension  on  edge  (/,  j).  Obviously,  ytj  =  0 
for  all  (/,  j)  £  Ne  is  a  feasible  solution  for  the  dual.  However,  finding  non-trivial 
internal  forces  is  a  fundamental  problem  in  network  and  structure  design,  and  the 
maximization  of  the  dual  would  help  to  achieve  the  goal. 

Many  optimization  problems  can  be  directly  cast  in  the  CLD  form. 

Example  4  ( Euclidean  Facility  Location).  This  problem  is  to  determine  the  location 
of  a  facility  serving  n  clients  placed  in  a  Euclidean  space,  whose  known  locations 
are  denoted  by  a7-  £  Ed,  j  -  1,  . . . ,  n.  The  location  of  the  facility  would  minimize 
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the  sum  of  the  Euclidean  distances  from  the  facility  to  each  of  the  clients.  Let  the 
location  decision  be  vector  f  e  Ed .  Then  the  problem  is 

minimize  YI)=\  If  -  ayj . 

The  problem  can  be  reformulated  as 

minimize  Y!j=\  $j 

subject  to  Sj  +  f  =  a7-,  Vy  =  1 , . . . ,  n, 

|s7j  <  Sj,  Vy  =  1, . . . ,  n. 

This  is  a  conic  formulation  in  the  (CLD)  form.  To  see  it  clearly,  let  d  =  2  and  n  —  3 
in  the  example,  and  let 

1  00  0  0 
0  00  -1  0 
0  00  0  -1 
0  10  0  0 

AT  =  0  00  -1  0  e  E9x5,  b 
0  00  0  -1 
0  0  10  0 
0  00  -1  0 
0  00  0  -1. 

and  variable  vector 

y  =  [<5i;  S2;  S3\  f]  e  E5. 

Then,  the  facility  location  problem  becomes 

minimize  yrb 

subject  to  yrA  +  sT  =  cr,  s  £  K; 

where  K  is  the  product  of  three  second-order  cones  each  of  which  has  dimension  3. 
More  precisely,  the  first  three  elements  of  s  €  E9  are  in  the  3 -dimensional  second- 
order  cone;  and  so  are  the  second  three  elements  and  the  third  three  elements  of  s. 
In  general,  the  product  of  (possibly  mixed)  cones,  say  K\ ,  K2  and  K3,  is  denoted  by 
K\  ®  K2  ©  A/3,  and  X  e  K\  ©  K2  ©  means  that  X  is  divided  into  three  components 
such  that 

X  =  (Xi;  X2;  X3),  where  Xi  e  K\,  X2  e  K2,  andX3  e  K3. 

The  dual  of  the  facility  location  problem  would  be  in  the  (CLP)  form: 

minimize  cTx 

subject  to  Ax  =  b,  x  e  K*\ 
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where 

K*  =  (K\  K3Y  =  K*  ©  K*  ©  K 

That  is,  in  this  particular  problem,  the  first  three  elements  of  x  e  E9  are  in  the 
3 -dimensional  second- order  cone;  and  so  are  the  second  three  elements  and  the  third 
three  elements  of  x. 

Consider  further  the  equality  constraints,  the  dual  can  be  simplified  as 

maximize  2  Li  aTx/ 

J  1  J  J 

subject  to  ZLi  xj  =  0  e  E 2, 

|x;-|  <  1,  Vj=  1,2,3. 

Example  5  ( Quadratic  Constraints).  Quadratic  constraints  can  be  transformed  to 
linear  semidefinite  form  by  using  the  concept  of  Schur  complements.  Let  A  be  a 
(symmetric)  ra-dimension  positive  definite  matrix,  C  be  a  symmetric  ^-dimension 
matrix,  and  B  be  an  m  x  n  matrix.  Then,  matrix 

S  =  C  -  BrA-1B 


is  called  the  Schur  complement  of  A  in  the  matrix 


A  B 
Br  C 


Moreover,  Z  is  positive  semidefinite  if  and  only  if  S  is  positive  semidefinite. 
Now  consider  a  general  quadratic  constraint  of  the  form 


yrBrBy  -  cry  -  d  <  0. 


(6.7) 


This  is  equivalent  to 

L,  rBy  ,  >0  (6.8) 

y1  B7  c1  y  +  d 

because  the  Schur  complement  of  this  matrix  with  respect  to  I  is  the  negative  of  the 
left  side  of  the  original  constraint  (6.7).  Note  that  in  this  larger  matrix,  the  variable 
y  appears  only  affinely,  not  quadratic  ally. 

Indeed,  (6.8)  can  be  written  as 


P(y)  -  Po  +  yiPi  +  y2^2  +  •  •  'yrFn  ^  0, 


(6.9) 


where 


for  i  =  1,2,  . . .  n 


with  b i  being  the  ith  column  of  B  and  c{  being  the  ith  component  of  c.  The  constraint 
(6.9)  is  of  the  form  that  appears  in  the  dual  form  of  a  semidefinite  program. 
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There  is  a  more  efficient  mixed  semidefinite  and  second-order  cone  formulation 
of  the  inequality  (6.7)  to  reduce  the  dimension  of  semidefinite  cone.  We  first  intro¬ 
duce  slack  variable  s  and  so  by  linear  constraints: 


By  -  s  =  0 


Then,  we  let  |s|  <  so  (or  (sq\  s)  in  the  second-order  cone)  and 


1  So 
So  cry  +  d 


>  0. 


Again,  the  matrix  constraint  is  of  the  dual  form  of  a  semidefinite  cone,  but  its 
dimension  is  fixed  at  2. 

Suppose  the  original  optimization  problem  has  a  quadratic  objective:  mini¬ 
mize  q(x).  The  objective  can  be  written  instead  as:  minimize  t  subject  to  q(x)  <  t, 
and  then  this  constraint  as  well  as  any  number  of  other  quadratic  constraints  can 
be  transformed  to  semidefinite  constraints,  and  hence  the  entire  problem  converted 
to  a  mixed  second-order  cone  and  semidefinite  program.  This  approach  is  useful 
in  many  applications,  especially  in  various  problems  of  financial  engineering  and 
control  theory. 


The  duality  is  manifested  by  the  relation  between  the  optimal  values  of  the  primal 
and  dual  programs.  The  weak  form  of  this  relation  is  spelled  out  in  the  following 
lemma,  the  proof  of  which,  like  the  weak  form  of  other  duality  relations  we  have 
studied,  is  essentially  an  accounting  issue. 

Weak  Duality  in  CLP .  Let  X  be  feasible  for  ( CLP)  and  (y,  S)  feasible  for  ( CLD).  Then, 

C  •  X  >  yrb. 


Proof.  By  direct  calculation 

m 

C  •  X  -  yrb  =  >’;A,  +  S)  •  X  -  yrb 

i=  1 
m 

=  ^y,'(A/ •  X)  +  S  •  X  -  yrb 

1=1 

m 

=  Y,  >’A  +  S  •  X  -  y7b 

1=1 

=  S  •  X  >  0, 

where  the  last  inequality  comes  from  X  e  K  and  S  e  K* .  I 

As  in  other  instances  of  duality,  the  strong  duality  of  conic  linear  programming 
is  weak  unless  other  conditions  hold.  For  example,  the  duality  gap  may  not  be  zero 
at  optimality  in  the  following  SDP  instance. 
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Example  6.  The  following  semidefinite  program  has  a  duality  gap: 


0 

1 

0 

0 

0 

0 

0 

-1 

0 

c  = 

1 

0 

0 

>  Ai  - 

0 

1 

0 

»  A2  - 

-1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

The  primal  minimal  objective  value  is  0  achieved  by 


0  0 
0  0 
0  0 


0 

0 

1 


and  the  dual  maximal  objective  value  is  -2  achieved  by  y  =  [0,  -1];  so  the  duality 
gap  is  2. 

However,  under  certain  technical  conditions,  there  would  be  no  duality  gap.  One 
condition  is  related  to  weather  or  not  the  primal  feasible  region  Tp  or  dual  feasible 
region  has  an  interior  feasible  solution.  We  say  Tp  has  an  interior  (feasible  solution) 
if  and  only  if 

Tp:=  {X  :  MK  =  b,  X  £°K) 

is  non-empty,  and  Td  has  an  interior  feasible  solution  if  and  only  if 


Td-=  Ky,S) :  yTJl  +  S  =  C,  S  €K  1 


is  non-empty.  We  state  here  a  version  of  the  strong  duality  theorem. 

Strong  Duality  in  ( CLP). 

i)  Let  (CLP)  or  (CLD)  be  infeasible,  and  furthermore  the  other  be  feasible  and  has  an 
interior  Then  the  other  is  unbounded. 

ii)  Let  (CLP)  and  (CLD)  be  both  feasible,  and  furthermore  one  of  them  has  an  interior. 
Then  there  is  no  duality  gap  between  ( CLP )  and  ( CLD). 

iii)  Let  (CLP)  and  (CLD)  be  both  feasible  and  have  interior.  Then,  both  have  optimal 
solutions  with  no  duality  gap. 

Proof.  We  let  cone  H  =  K  ®  E\  in  the  following  proof, 
i)  Suppose  Td  is  empty  and  Tp  is  feasible  and  has  an  interior  feasible  solution. 

_  o 

Then,  we  have  an  X  eK  and  f  =  1  that  is  an  interior  feasible  solution  to  (homo¬ 
geneous)  conic  system: 


JIX  -  bf  =  0,  (X,  f )  £H  . 
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Now,  for  any  z\  we  form  an  alternative  system  pair  based  on  Farkas’  Lemma 
(Theorem  2): 

{(X,  r)  :  JIX  -  br  =  0,  C  •  X  -  z*r  <  0,  (X,r)  €  H], 


and 

{(y;  S, k)  :  J{Ty  +  S  =  C,  -bry  +  k  =  -z\  (S,  k)  e  H*}. 

But  the  latter  is  infeasible,  so  that  the  former  has  a  feasible  solution  (X,  r). 
At  such  a  feasible  solution,  if  r  >  0,  we  have  C  •  (X/r)  <  z*  for  any  z*. 
Otherwise,  r  =  0  implies  that  a  new  solution  X  +  aX  is  feasible  for  (CLP) 
for  any  positive  a;  and,  as  a  — >  oo,  the  objective  value  of  the  new  solution  goes 
to  -oo.  Hence,  either  way  we  have  a  feasible  solution  for  (CLP)  whose  objective 
value  is  unbounded  from  below. 

ii)  Let  T'p  be  feasible  and  have  an  interior  feasible  solution,  and  let  z*  be  its  objec¬ 
tive  infimum.  Again,  we  have  an  alternative  system  pair  as  listed  in  the  proof 
of  i).  But  now  the  former  is  infeasible,  so  that  we  have  a  solution  for  the  latter. 
From  the  Weak  Duality  theorem  bry  <  z*,  thus  we  must  have  k  =  0,  that  is,  we 
have  a  solution  (y,  S)  such  that 

JlTy  +  S  =  C,  bTy  =  z\  SeK*. 


iii)  We  only  need  to  prove  that  there  exist  a  solution  Xefp  such  that  C  •  X  =  z*, 
that  is,  the  infimum  of  (CLP)  is  attainable.  But  this  is  just  the  other  side  of  the 
proof  given  that  Td  is  feasible  and  has  an  interior  feasible  solution,  and  z*  is 
also  the  supremum  of  (CLD).  1 


Again,  if  one  of  (CLP)  and  (CLD)  has  no  interior  feasible  solution,  the  common 
objective  value  may  not  be  attainable.  For  example, 


C  = 


1  0 
00 


A] 


0  1 
1  0 


and  b\  =  2. 


The  dual  is  feasible  but  has  no  interior,  while  the  primal  has  an  interior.  The  common 
objective  value  equals  0,  but  no  primal  solution  attaining  the  infimum  value. 

Most  of  these  examples  that  make  the  strong  duality  failed  are  superficial,  and 
a  small  perturbation  would  overcome  the  failure.  Thus,  in  real  applications  and  in 
the  rest  of  the  chapter,  we  may  assume  that  both  (CLP)  and  (CLD)  have  interior 
when  they  are  feasible.  Consequently,  any  primal  and  dual  optimal  solution  pair 
must  satisfy  the  optimality  conditions: 

C  •  X  -  yrb  =  0 
J[X  =  b 
yTJ[  +  S  =  CT 
XeK,  S  eK* 


9 


(6.10) 
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or 


X#S  =  0 
yiX  =  b 
yTJ{  +  S  =  CT 
X  e  K,  S  €K* 


(6.11) 


We  now  present  an  application  of  the  strong  duality  theorem. 

Example  7  (Robust  Portfolio  Design).  The  Markowitz  portfolio  design  model  (also 
see  5)  is 


'T 

minimize  x  Zx 

'T  T 

subject  to  1  x  =  1,  7 r  x  >  7r, 

where  Z  is  the  covariance  matrix  and  n  is  the  expect  return  rate  vector  of  a  set  of 
stocks,  and  n  is  the  desired  return  rate  of  the  portfolio.  The  problem  can  be  equiva¬ 
lently  written  as  a  mixed  conic  problem 

minimize  Z  •  X 

rT1  rri 

subject  to  1  x  =  1,  7 r  x  >  7r, 

X  -  xxT  >  0. 


Now  supposed  is  incomplete  and/or  uncertain,  and  it  is  expressed  by 


m 


+  ox 


i=  1 


for  some  variables  yf  s.  Then,  we  like  to  solve  a  robust  model 


minimize 


maxy  (r0  +  Z'"  i  ViZi)  •  X 

s.t.To+i;r=1^>o 

'T  T 

subject  to  1  x  =  1,  7 r  x  >  7r, 

X  -  xxT  >  0. 


The  inner  problem  is  an  SDP  problem.  Assuming  strong  duality  holds,  we  replace 
it  by  its  dual,  and  have 


T  miny  Zq  •  (Y  +  X) 

minimize  j  s.t.  Zt  •  (Y  +  X)  =  0,  V/  =  1, 

{  Y  >  0 

rp  rj- r 

subject  to  1  X  =  1,  K  X  >  7T, 

X  -  xxT  >  0. 


m, 
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Then,  we  can  integrate  the  two  minimization  problems  together  and  form 

minimize  •  (Y  +  X) 

•  rT  T 

subject  to  1  X  —  1,  7T  X  >  7T, 

Ei  •  (Y  +  X)  =  0,  V/  =  1, . . . ,  m, 

Y  >  0,  X  -  xxr  >  0. 


6.5  Complementarity  and  Solution  Rank  of  SDP 

In  linear  programming,  since  x  >  0  and  s  >  0, 

n 

0  =  X  •  S  =  XTS  =  ^  XjSj 

7-1 

implies  that  x/sy  =  0  for  all  j  =  1 .,n.  This  property  is  often  called  complemen¬ 
tarity.  Thus,  besides  feasibility,  and  optimal  linear  programming  solution  pair  must 
satisfy  complementarity. 

Now  consider  semidefinite  cone  Sn+.  Since  X  >  0  and  S  >  0,  0  =  X  •  S  implies 
XS  =  0,  that  is,  the  regular  matrix  product  of  the  two  is  a  zero  matrix.  In  other 
words,  every  column  (or  row)  of  X  is  orthogonal  to  every  column  (or  row)  of  X. 
We  also  call  such  property  complementarity.  Thus,  besides  feasibility,  an  optimal 
semidefinite  programming  solution  pair  must  satisfy  complementarity. 

Proposition  1.  Let  X*  and  (y*,  S*)  be  any  optimal  SDP  solution  pair  with  zero  duality  gap. 
Then  complementarity  ofX*  and  S*  implies 

rank(X*)  +  rank(S*)  <  n. 

Furthermore,  is  there  an  optimal  (dual)  S*  such  that  rankS  >  d,  then  the  rank  of  any 
optimal  ( primal )  X*  is  bounded  above  by  n  -  d,  where  integer  0  <  d  <  n;  and  the  converse 
is  also  true. 

In  certain  SDP  problems,  one  may  be  interested  in  finding  an  optimal  solution 
whose  rank  is  minimal,  while  the  interior-point  algorithm  for  SDP  (developed  later) 
typically  generates  solution  whose  rank  is  maximal  for  primal  and  dual,  respec¬ 
tively.  Thus,  a  rank  reduction  method  sometimes  is  necessary  to  achieve  this  goal. 
For  linear  programming  in  the  standard  form,  it  is  known  that  if  there  is  an  optimal 
solution,  then  there  is  an  optimal  basic  solution  x*  whose  positive  entries  have  at 
most  m  many.  Is  there  a  similar  structural  fact  for  semidefinite  programming?  In 
deed,  we  have 

Proposition  2.  If  there  is  an  optimal  solution  for  SDP,  then  there  is  an  optimal  solution  of 
SDP  whose  rank  r  satisfies  <  m 

The  proposition  resembles  the  linear  programming  fundamental  theorem  of 
Caratheodory  in  Sect.  2.4.  We  now  give  a  sketch  of  similar  constructive  proof,  as 
well  as  several  other  rank-reduction  methods. 
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Null-Space  Rank  Reduction 


Let  X*  be  an  optimal  solution  of  SDP  with  rank  r.  If  r(r  +  l)/2  >  m,  we  orthonor- 
mally  factorize  X* 


x*  =  (V*)rV*, 


V*  6  Erxn. 


Then  we  consider  a  related  SDP  problem 


minimize  V*C(V*)r  •  U 

subject  to  V*A/(V*)r  •  U  =  i  =  1, . . . ,  m  (6.12) 

u  ey+. 


Note  that,  for  any  feasible  solution  of  (6.12)  one  can  construct  a  feasible  solution 
for  original  SDP  using 

X(U)  =  (V*)rUV*  and  C  •  X(U)  =  V*C(V*)7’  •  U. 


Thus,  the  minimal  value  of  (6.12)  is  also  z*,  and  in  particular  U  =  I  (the  identity 
matrix)  is  an  minimizer  of  (6.12),  since 

V*C(V*)T  •  I  =  €  •  (V*)rv*  =  C  •  X*  =  Z*. 


Also,  one  can  show  that  any  feasible  solution  U  of  (6.12)  is  its  minimizer,  so  that 
X(U)  is  a  minimizer  of  original  SDP. 

Consider  the  system  of  homogeneous  linear  equations: 

V*A,-(V*)7'  •  W  =  0, 


where  W  e  Sr  (i.e  .,  a  r  x  r  symmetric  matrices  that  does  not  need  to  be  semidef- 
inite).  This  system  has  r(r  +  l)/2  real  variables  and  m  equations.  Thus,  as  long  as 
r(r  +  l)/2  >  m,  we  must  be  able  to  find  a  symmetric  matrix  W  ^  0  to  satisfy  all 
the  m  equations.  Without  loss  of  generality,  let  W  be  either  indefinite  or  negative 
semidefinite  (if  it  is  positive  semidefinite,  we  take  -W  as  W),  that  is,  W  have  at 
least  one  negative  eigenvalue.  Then  we  consider 

U(or)  =  I  +  orW. 

Choosing  a  a*  sufficiently  large  such  that  U(cr*)  >  0  and  it  has  at  least  one  0  eigen¬ 
value  (or  rankU(cC)  <  r).  Note  that 

V*A/(V*)7'  •  U(o;*)  =  V*A/(V*)3'  •  (I  +  a*W)  =  V*A i(V*)T  •  /  =  b„ 


That  is,  U(cr*)  is  feasible  and  also  optimal  for  (6.12).  Thus,  X(U(ar*))  is  a  new  min¬ 
imizer  for  the  original  SDP,  and  its  rank  is  strictly  less  than  r.  This  process  can  be 
repeated  till  the  system  of  homogeneous  linear  equations  has  only  all-zero  solution, 
which  is  necessary  when  r(r  +  l)/2  <  m.  Such  a  solution  rank  reduction  procedure 
is  called  the  Null-space  reduction,  which  is  deterministic. 
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To  see  an  application  of  Proposition  2,  consider  a  general  quadratic  minimization 
with  sphere  constraint 


z*  =  minimize  xrQx  +  2crx 
subject  to  |x|2  =  1,  x  <g  En , 


where  Q  is  general.  The  problem  has  an  SDP  relaxation: 


c  np 

z  =  maximize 


Q  c 

cT  0 


•  Y 


subject  to 


I  0 

0r  0 

0  0 

0r  1 


•  Y  =  1, 

•  Y  =  1, 


Y  eSn++l. 


Note  that  the  relaxation  and  its  dual  both  have  interior  so  that  the  strong  duality 
theorem  holds,  and  it  must  have  a  rank-1  optimal  SDP  solution  because  m  -  2.  But 
a  rank- 1  optimal  SDP  solution  would  be  optimal  to  the  original  quadratic  minimiza¬ 
tion  with  sphere  constraint.  Thus,  we  must  have  z*  =  zSDP . 


Gaussian  Projection  Reduction 

There  is  also  a  randomized  procedure  to  produce  an  approximate  SDP  solution  with 
a  desired  low  rank  d.  Again,  let  X*  be  an  optimal  solution  of  SDP  with  rank  r  >  d 
and  we  factorize  X*  as 


x*  =  (V*)rv*, 


V*  e  Erxn. 


We  then  generate  i.i.d.  Gaussian  random  variables  with  mean  0  and  variance  1  / d, 

i  =  1 , . . . ,  r;  j  =  1 , . . . ,  d,  and  form  random  vectors  f 7  =  (f [  \ . . . ;  gJr),  j  =  1 , . . . ,  d. 
Finally,  we  let 


/V 

Note  that  the  rank  of  X  is  d  and 


A 

One  can  further  show  that  X  would  be  a  good  rank -d  approximate  SDP  solution  in 
many  cases. 


6.5  Complementarity  and  Solution  Rank  of  SDP 


169 


Randomized  Binary  Reduction 


As  discussed  in  the  binary  QP  optimization,  we  like  to  produce  a  vector  x  where 
each  entry  is  either  1  or  -1.  A  procedure  to  achieve  this  is  as  follows.  Let  X*  be  any 
optimal  solution  of  SDP  and  we  factorize  X*  as 


x*  =  (vyv*, 


<E 


j^nxn 


Then,  we  generate  a  random  ^-dimensional  vector  ^  where  each  entry  is  a  i.i.d.  Gaus¬ 
sian  random  variable  with  mean  0  and  variance  1 .  Then  we  let 


x  =  sign((V*)r0 


where 


sign(v)  = 


1  if  v  >  0 
-1  otherwise. 


It  was  proved  by  Sheppard  [228]: 

E [XiXjl  =  -  arcsin(X*7),  z,  j  =  1, 2, . . . ,  n. 
n  J 

Obviously,  each  entry  of  x  is  either  1  or  - 1 . 

One  can  further  show  x  would  be  a  good  approximate  solution  to  the  origi¬ 
nal  binary  QP.  Let  us  consider  the  (homogeneous)  binary  quadratic  maximization 
problem 


z*  :=  maximize  xrQx 

subject  to  Xj  =  { 1,  -1},  for  ally  =  1,  . . . ,  n , 
where  we  assume  Q  is  positive  semidefinite.  Then,  the  SDP  relaxation  would  be 

SDP  . 

z  :=  maximize  Q  •  X 

subject  to  lj  •  X  =  1,  for  ally  =  1,  . . . ,  n, 

XeSn+; 

and  let  X*  be  any  optimal  solution,  from  which  we  produced  a  random  binary  vector 
x.  Let  us  evaluate  the  expected  objective  value 

2  2 

E(xrQx)  =  E(Q  •  xxr)  =  Q  •  E(xxr)  =  Q  •  -  arcsin[X*]  =  -( Q  •  arcsin[X*]), 

7T  7T 

where  arcsin[X*]  e  Sn  whose  (/,  y)  the  entry  equals  arcsin(XC).  One  can  further 
show 


arcsin[X*]  -  X*  >  0 
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so  that  (from  Q  >  0) 


Q  •  arcsin[X*]  >Q»X*=zSDP>  z\ 

that  is,  the  expected  objective  value  of  x  is  no  less  than  factor  of  the  maximal 
value  of  the  binary  QP. 

The  randomized  binary  reduction  can  be  extended  to  quadratic  optimization  with 
simple  bound  constraints  such  as  v2  <  1 . 


6.6  Interior-Point  Algorithms  for  Conic  Linear  Programming 

Since  (CLP)  is  a  convex  minimization  problem,  many  optimization  algorithms  are 
applicable  for  solving  it.  However,  the  most  natural  conic  linear  programming  algo¬ 
rithm  seems  to  be  an  extension  of  the  interior-point  linear  programming  algorithm 
described  in  Chap.  5.  We  describe  what  it  is  now. 

To  develop  efficient  interior-point  algorithms,  the  key  is  to  find  a  suitable  barrier 
or  potential  function.  There  is  a  general  theory  on  selection  of  barrier  functions  for 
(CLP),  depending  on  the  convex  cone  involved.  We  present  few  for  the  convex  cones 
listed  in  Example  1 . 

Example  1.  The  following  are  barrier  function  for  each  of  the  convex  cones. 

•  The  n- dimensional  non-negative  orthant  2s” : 

n 

B(x)  =  -  E  i°g  (*/)• 
j=  i 

•  The  ^-dimensional  semidefinite  cone  S+ : 

B(X)  =  -  log(detX). 

•  The  ( n  +  l)-dimensional  second-order  cone  {(w;x)  :  u  >  |x|}: 

B(u;x)  =  -1  og(u2  -  |x|2). 

In  the  rest  of  the  section,  we  devote  our  discussion  on  solving  (SDP).  Similar  to 
LP,  we  consider  (SDP)  with  the  barrier  function  added  in  the  objective: 

(, S  DPB )  minimize  C  •  X  -  p  log  det(X) 

o 

subject  to  X 

or  (SDD)  with  the  barrier  function  added  in  the  objective: 

(, S  DDB)  maximize  yrb  +  p  log  det(S) 

o 

subject  to  (y,  S)  <=Td, 
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where  again  p  >  0  is  called  the  barrier  weight  parameter.  For  a  given  p,  the  mini- 
mizers  of  (SDPB)  and  (SDDB)  satisfy  conditions: 


XS  =  p  I 
MX  =  b 
Fry  +  S  =  C 
X  >  0,  S  >  0 


(6.13) 


Since 


trace(XS)  X»S  €  •  X  -  yrb 

H  = - = - = - , 

n  n  n 

so  that  p  equals  the  average  of  complementarity  or  duality  gap.  And,  these  minimiz- 
ers,  denoted  by  (X(p),  yip),  S ip)),  form  the  central  path  of  SDP  for  mu  e  (0,  oo).  It  is 
known  that  when  p  — »  0,  (X(p),  yip),  S ip))  tends  to  an  optimal  solution  pair  whose 
rank  is  maximal  (Exercise  11). 

We  can  also  extend  the  primal-dual  potential  function  from  LP  to  SDP  as  a 
descent  merit  function: 

<An+P(X,  S)  =  in  +  p)  log(X  •  S)  -  log(det(X)  •  det(S)) 

where  p  >  0.  Note  that  if  X  and  S  are  diagonal  matrices,  these  definitions  reduce  to 
those  for  linear  programming. 

Once  we  have  an  interior  feasible  point  (X,  y,  S),  we  can  generate  a  new  iterate 
(X+,  y+,  S+)  by  solving  for  (Dx,  dy,  D  s)  from  the  primal- dual  system  of  linear 
equations 


D'DvD1  +  De  = 


n 


/|X_1  -  S, 


n  +  p 

A;  •  Dx  =  0,  for  all  i, 

0, 


Zm 

(dv),A,  +  Ds  = 


where  D  is  the  (scaling)  matrix 


D  =  X5  (X5  SX5)-5X5 


(6.14) 


and  p  =  X  •  S/n.  Then  one  assigns  X+  =  X  +  crDx,  y+  =  y  +  ady,  and  S+  =  s  +  aDs 
for  a  step  size  a  >  0.  Furthermore,  it  can  be  shown  that  there  exists  a  step  size  a  =  a 
such  that 

i/sn+piX+,  S+)  -  ifsn+PiX,  S)  <  -6 

for  a  constant  6  >  0.2. 

We  outline  the  algorithm  here 

Step  1.  Given  (X°,  y°,  S°)  Set  p  >  yfn  and  k  :=  0. 

Step  2.  Set  (X,  S)  =  (Xk,  Sk)  and  compute  (Dx,  dy,  Ds)  from  (6. 14). 


172 


6  Conic  Linear  Programming 


Step  3.  Let  XA'+1  =  Xk  +  aDx,  y^+1  =  yk  +  crdy,  and  Sk+i  =  +  oDs,  where 

a  =  argmini/^+AX^  +  oDx,  Sk  +  aDs). 

a>0 

Step  4.  Let  k  :=  k  +  1.  If  |^§o  <  e,  Stop.  Otherwise  return  to  Step  2. 

Theorem  3.  Let  if/n+p(X°,  S°)  <  plog(X°  •  S°)  +  nlogn.  Then,  the  algorithm  terminates  in 
at  most  0(p  login/ e)  iterations. 


Initialization:  The  HSD  Algorithm 

The  linear  programming  Homogeneous  Self-Dual  Algorithm  is  also  extendable  to 
conic  linear  programming.  Consider  the  minimization  problem  Homogeneous  self¬ 
dual  algorithm!  for  conic  linear  programming 

iHSDCLP)  min 

s.t.  fffX  -br 

-J\Ty  +Cr 

bry  -C  •  X 
-bry  +C  •  X  -zt 
y  free,  X  £  K,  r  >  0, 

where 

b  =  b  -  m°,  C  =  C  -  S°,  2  =  C  •  X°  +  1 

Here  X°  and  S°  are  any  pair  of  interior  points  in  the  interior  of  K  and  K*  such 
that  they  form  a  central  path  point  with  p  =  1.  Note  that  X°  and  S°  don’t  need  to 
satisfy  other  equality  constraint,  so  that  they  can  be  easily  identified.  For  examples, 
x°  —  y°  =  1  for  the  nonnegative  orthant  cone;  x°  =  y°  =  (1;  0)  for  the  p-order  cone; 
and  X°  =  X°  =  I  for  the  semidefinite  cone. 

Let  T  be  the  set  of  all  feasible  points  (y,  X  e  K,  r  >  0, 6,  S  e  K*,  k  >  0).  Then  T 

o  o  * 

is  the  set  of  interior  feasible  points  (y,  X  £K ,  r  >  0, 6,  S  £K  ,  k  >  0). 

Theorem  4.  Consider  the  conic  optimization  (HSDCLP). 

i)  ( HSDCLP )  is  self -dual,  that  is,  its  dual  has  an  identical  form  of  (HSDCLP). 

ii)  ( HSDCLP )  has  an  optimal  solution  and  its  optimal  solution  set  is  bounded. 

iii)  (HSDCLP)  has  an  interior  feasible  point 

y  =  0,  X  =  X°,  r  =  1,  0=1,  S  =  S°,  k  =  1. 

iv)  For  any  feasible  point  (y,  X,  r,  6,  S,  k)  g  T 

S°  •  X  +  X°  •  S  +  r  +  K  -  (n  +  1)6  =  (n  +  1), 


( ft  +  1)6 

+6  6  =  0, 

-CO  =  S  €  K*, 
+W  =  k  >  0, 

=  -(/!+  1), 

6  free, 


and 


X  •  S  +  tk  =  (n  +  1)6. 
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v)  The  optimal  objective  value  of  (HSDCLP)  is  zero,  that  is,  any  optimal  solution  of 
(HSDCLP)  has 

X*  •  S*  +  tV  =  (n  +  1)6*  =  0. 

Now  we  are  ready  to  apply  the  interior-point  algorithm,  starting  from  a  available 
initial  interior-point  feasible  solution,  to  solve  (HSDCLP).  The  question  is:  how  is 
an  optimal  solution  of  (HSDCLP)  related  to  optimal  solutions  of  original  (CLP)  and 
(CLD)?  We  present  the  next  theorem,  and  leave  this  proof  as  an  exercise. 

Theorem  5.  Let  (y *,X*,r*,6*  =  0,  S*,/c*)  be  a  ( maximal  rank )  optimal  solution  of  (HSD¬ 
CLP)  (as  it  is  typically  computed  by  interior-point  algorithms ). 

i)  (CLP)  and  (CLD)  have  an  optimal  solution  pair  if  and  only  if  r*  >  0.  In  this  case, 

X*  (r*  is  an  optimal  solution  for  (CLP)  and  (y*/r*,  S*/r*)  is  an  optimal  solution  for 
(CLD). 

ii)  ( CLP)  or  ( CLD)  has  an  infeasibility  certificate  if  and  only  if  k*  >  0.  In  this  case,  X*  /k* 
or  S*/k*  or  both  are  certificates  for  proving  infeasibility;  see  Parkas  ’  lemma  for  CLP. 

iii)  For  all  other  cases,  t*  -  k*  -  0. 


6.7  Summary 

A  relatively  new  class  of  mathematical  programming  problems,  Conic  linear  pro¬ 
gramming  (hereafter  CLP),  is  a  natural  extension  of  Linear  programming  that  is  a 
central  decision  model  in  Management  Science  and  Operations  Research.  In  CLP, 
the  unknown  is  a  vector  or  matrix  in  a  closed  convex  cone  while  its  entries  are  also 
restricted  by  some  linear  equalities  and/or  inequalities. 

One  of  cones  is  the  semidefinite  cone,  that  is,  the  set  of  all  symmetric  positive 
semidefinite  matrices  in  a  given  dimension.  There  is  a  variety  of  interesting  and 
important  practical  problems  that  can  be  naturally  cast  in  this  form.  Because  many 
problems  which  appear  nonlinear  (such  as  quadratic  problems)  become  essentially 
linear  in  semidefinite  form.  We  have  described  some  of  these  applications  and  se¬ 
lected  results  in  Combinatory  Optimization,  Robust  Optimization,  and  Engineering 
Sensor  Network.  We  have  also  illustrated  some  analyses  to  show  why  CLP  is  an 
effective  model  to  tackle  these  difficult  optimization  problems. 

We  present  fundamental  theorems  underlying  conic  linear  programming.  These 
theorems  include  Farkas’  lemma,  weak  and  strong  dualities,  and  solution  rank  struc¬ 
ture.  We  show  the  common  features  and  differences  of  these  theorems  between  LP 
and  CLP. 

The  efficient  interior-point  algorithms  for  linear  programming  can  be  extended 
to  solving  these  problems  as  well.  We  describe  these  extensions  applied  to  gen¬ 
eral  conic  programming  problems.  These  algorithms  closely  parallel  those  for  linear 
programming.  There  is  again  a  central  path  and  potential  functions,  and  Newton’s 
method  is  a  good  way  to  follow  the  path  or  reduce  the  potential  function.  The  homo¬ 
geneous  and  self-dual  algorithm,  which  is  popularly  used  for  linear  programming, 
is  also  extended  to  CLP. 
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6.8  Exercises 

1.  Prove  that 

i)  The  dual  cone  of  En+  is  itself. 

ii)  The  dual  cone  of  Sn+  is  itself. 

iii)  The  dual  cone  of  p-order  cone  is  the  p-order  cone  where  -  +  -  =  1  and 

1  <  p  <  co. 

2.  When  both  K\  and  K2  are  closed  convex  cones.  Show 

i)  (K;y  =  K\ . 

ii)  Ki  c  K2  =^>  K*2  c  K\. 

iii)  (Kx®K2y  =  K{®K*. 

iv)  (Kx  +  K2y  =  K\  0  K*v 

v)  {K\  n  K2y  =  K{  +  k*. 

Note:  by  definition  S  +  T  =  {s  +  t :  s  £  S,  t  e  T). 

3.  Prove  the  following: 

i)  Theorem  1. 

ii)  Proposition  1. 

iii)  Let  X  ek  and  Y  eK\  Then  X  •  Y  >  0. 

4.  Guess  an  optimal  solution  and  the  optimal  objective  value  of  each  instance  of 
Example  1. 

5.  Prove  the  second  statement  of  Theorem  2. 

6.  Verify  the  weak  duality  theorem  of  the  three  CLP  instances  in  Example  1  in 
Sect.  6.2  and  Example  1  in  Sect.  6.4. 

7.  Consider  the  SDP  relaxation  of  the  sensor  network  localization  problem  with 
four  sensors: 

(e,  -  e;)(e,  -  ej)T  •  X  =  1,  V/  <  /  =  1, 2, 3, 4, 

XeS*, 

in  which  m  -  6.  Show  that  the  SDP  problem  has  the  solution  with  rank  3,  which 
reaches  the  bound  of  Proposition  2. 

8.  Let  A  and  B  be  two  symmetric  and  positive  semidefinite  matrices.  Prove  that 
A  •  B  >  0,  and  A  •  B  =  0  implies  AB  =  0. 

9.  Let  X  and  S  both  be  positive  definite.  Prove  that 

n  log(X  •  S)  -  log(det(X)  •  det(S))  >  n  1  ogn. 

10.  Consider  a  SDP  and  the  potential  level  set 

T(fi)  =  {(X,  y,  S)  €  r  :  n+p  (X,  S)  <  5}. 
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Prove  that 

4V1)  c  4V2)  if  Sl  <  S2, 

and  for  every  6,  T(d)  is  bounded  and  its  closure  *P(d)  has  non-empty  intersec¬ 
tion  with  the  SDP  solution  set. 

1 1 .  Let  both  (SDP)  and  (SDD)  have  interior  feasible  points.  Then  for  any  0  <  p  <  oo, 
the  central  path  point  (X(//),  y(/i),  S(/i))  exists  and  is  unique.  Moreover, 

i)  the  central  path  point  (X(/i),  y(/i),  S (p))  is  bounded  where  0  <  p  <  p°  for 
any  given  0  <  jj°  <  oo. 

ii)  For  0  <  p'  <  p, 

C  •  XO u)  <  C  •  X(p)  and  bryO u!)  >  bry (p) 


if  X(p)  ±  X(p')  and  y (p)  ±  y (//). 

iii)  (X(/i),  y(yi),  S(/i))  converges  to  an  optimal  solution  pair  for  (SDP)  and 
(SDD),  and  the  rank  of  the  limit  of  X(p)  is  maximal  among  all  optimal 
solutions  of  (SDP)  and  the  rank  of  the  limit  S (p)  is  maximal  among  all 
optimal  solutions  of  (SDD). 


12.  Prove  the  logarithmic  approximation  lemma  for  SDP.  Let  D  e  Sn  and  |D|oo  <  L 
Then, 

|D|2 

trace(D)  >  log  det(I  +  D)  >  trace(D) - . 

~  2(1  -  |D|  oo) 

on  ^ 

13.  Let  V  eS+  and p  >  ^fn.  Then, 


|V-1/2  -  y^V1/2| 

rv^u 


> 


14.  Prove  both  Theorems  4  and  5. 


References 

6.1  Most  of  the  materials  presented  can  be  found  from  convex  analysis,  such  as 
Rockeafellar  [219]. 

6.2  Semidefinite  relaxations  have  appeared  in  relation  to  relaxations  discrete  opti¬ 
mization  problems.  In  Lovasz  and  Shrijver  [159],  a  “lifting”  procedure  is  pre- 

2 

sented  to  obtain  a  problem  in  %n  ;  and  then  the  problem  is  projected  back  to 
obtain  tighter  inequalities;  see  also  Balas  et  al.  [12].  Then,  there  have  been 
several  remarkable  results  of  SDP  relaxations  for  combinatorial  optimization. 
The  binary  QP,  a  generalized  Max-Cut  problem,  was  studied  by  Goemans  and 
Williamson  [G8]  and  Nesterov  [189].  Other  SDP  relaxations  can  be  found  in 
the  survey  by  Luo  et  al.  [171]  and  references  therein.  More  CLP  applications 
can  be  found  in  Boyd  et  al  [B22],  Vandenberghe  and  Boyd  [V2],  and  Lobo, 
Vandenberghe  and  Boyd  [156],  Lasserre  [150],  Parrilo  [204],  etc. 
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6  Conic  Linear  Programming 


The  sensor  localization  problem  described  here  is  due  to  Biswas  and  Ye  [B 17]. 
Note  that  we  can  view  the  Sensor  Network  Localization  problem  as  a  Graph 
Realization  or  Embedding  problem  in  Euclidean  spaces,  see  So  and  Ye  [231] 
and  references  therein;  and  it  is  related  to  the  Euclidean  Distance  Matrix  Com¬ 
pletion  Problems,  see  Alfakih  et  al.  [3]  and  Laurent  [151]. 

6.3  Farkas’  lemma  for  conic  linear  constraints  are  closely  linked  to  convex  analysis 
(i.e,  Rockeafellar  [219])  and  the  CLP  duality  theorems  commented  next. 

6.4  The  conic  formulation  of  the  Euclidean  facility  location  problem  was  due  to 
Xue  and  Ye  [264].  For  discussion  of  Schur  complements  see  Boyd  and  Vander- 
berghe  [B23].  Robust  optimization  models  using  SDP  can  be  found  in  Ben-Tal 
and  Nemirovski  [26]  and  Goldfarb  and  Iyengar  [112],  and  etc.  The  SDP  duality 
theory  was  studied  by  Barvinok  [16],  Nesterov  and  Nemirovskii  [N2],  Ramana 
[214],  Ramana  e  al.  [215],  etc.  The  SDP  example  with  a  duality  gap  was  con¬ 
structed  by  R.  Freund  (private  communication). 

6.5  Complementarity  and  rank.  The  exact  rank  theorem  described  here  is  due  to 
Pataki  [205],  also  see  Barvinok  [15].  A  analysis  of  the  Gaussian  projection  was 
presented  by  So  et  al.  [232]  which  can  be  sees  as  a  generalization  of  the  John¬ 
son  and  Lindenstrauss  theorem  [137].  The  expectation  of  the  randomized  binary 
reduction  is  due  to  Sheppard  [228]  in  1900,  and  it  was  extensively  used  in  Goe- 
mans  and  Williamson  [G8]  and  Nesterov  [189],  Ye  [265],  and  Bertsimas  and 
Ye,  [31]. 

6.6  In  interior-point  algorithms,  the  search  direction  (Dx,  dy,  Ds)  can  be  determined 
by  Newton’s  method  with  three  different  scalings:  primal,  dual  and  primal-dual. 
A  primal- scaling  (potential  reduction)  algorithm  for  semidefinite  programming 
is  due  to  Alizadeh  [A4,  A3]  where  Yinyu  Ye  “suggested  studying  the  primal- 
dual  potential  function  for  this  problem”  and  “looking  at  symmetric  preserving 
scalings  of  the  form  X~ll2XX~ll2'\  and  to  Nesterov  and  Nemirovskii  [N2].  A 
dual- scaling  algorithm  was  developed  by  Benson  et  al.  [25]  which  exploits  the 
sparse  structure  of  the  dual  SDP.  The  primal-dual  SDP  algorithm  described  here 
is  due  to  Nesterov  and  Todd  [N3]  and  references  therein. 

Efficient  interior-point  algorithms  are  also  developed  for  optimization  over  the 
second-order  cone;  see  Nesterov  and  Nemirovskii  [N2]  and  Xue  and  Ye  [264]. 
These  algorithms  have  established  the  best  approximation  complexity  results 
for  certain  combinatorial  location  problems. 

The  homogeneous  and  self-dual  initialization  model  was  originally  developed 
by  Ye,  Todd  and  Mizuno  for  LP  [Y2],  and  for  SDP  by  de  Klerk  et  al.  [72],  Luo 
et  al.  [LI 8],  and  Nesterov  et  al.  [191],  and  it  became  the  foundational  algorithm 
implemented  in  Sturm  [S 1 1]  and  Andersen  [6]. 


Part  II 

Unconstrained  Problems 


Chapter  7 

Basic  Properties  of  Solutions  and  Algorithms 


In  this  chapter  we  consider  optimization  problems  of  the  form 

minimize  /(x)  (7.1) 

subject  to  x  <g  Q, 

where  /  is  a  real- valued  function  and  Q,  the  feasible  set,  is  a  subset  of  En. 
Throughout  most  of  the  chapter  attention  is  restricted  to  the  case  where  Q  =  En, 
corresponding  to  the  completely  unconstrained  case,  but  sometimes  we  consider 
cases  where  Q  is  some  particularly  simple  subset  of  En . 

The  first  and  third  sections  of  the  chapter  characterize  the  first-  and  second-order 
conditions  that  must  hold  at  a  solution  point  of  (7.1).  These  conditions  are  simply 
extensions  to  En  of  the  well-known  derivative  conditions  for  a  function  of  a  single 
variable  that  hold  at  a  maximum  or  a  minimum  point.  The  fourth  and  fifth  sections 
of  the  chapter  introduce  the  important  classes  of  convex  and  concave  functions  that 
provide  zeroth-order  conditions  as  well  as  a  natural  formulation  for  a  global  theory 
of  optimization  and  provide  geometric  interpretations  of  the  derivative  conditions 
derived  in  the  first  two  sections. 

The  final  sections  of  the  chapter  are  devoted  to  basic  convergence  characteristics 
of  algorithms.  Although  this  material  is  not  exclusively  applicable  to  optimization 
problems  but  applies  to  general  iterative  algorithms  for  solving  other  problems  as 
well,  it  can  be  regarded  as  a  fundamental  prerequisite  for  a  modern  treatment  of 
optimization  techniques.  Two  essential  questions  are  addressed  concerning  itera¬ 
tive  algorithms.  The  first  question,  which  is  qualitative  in  nature,  is  whether  a  given 
algorithm  in  some  sense  yields,  at  least  in  the  limit,  a  solution  to  the  original  prob¬ 
lem.  This  question  is  treated  in  Sect.  7.6,  and  conditions  sufficient  to  guarantee 
appropriate  convergence  are  established.  The  second  question,  the  more  quantita¬ 
tive  one,  is  related  to  how  fast  the  algorithm  converges  to  a  solution.  This  question 
is  defined  more  precisely  in  Sect.  7.7.  Several  special  types  of  convergence,  which 
arise  frequently  in  the  development  of  algorithms  for  optimization,  are  explored. 
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7  Basic  Properties  of  Solutions  and  Algorithms 


7.1  First-Order  Necessary  Conditions 

Perhaps  the  first  question  that  arises  in  the  study  of  the  minimization  problem  (7.1) 
is  whether  a  solution  exists.  The  main  result  that  can  be  used  to  address  this  issue  is 
the  theorem  of  Weierstras,  which  states  that  if  /  is  continuous  and  Q  is  compact,  a 
solution  exists  (see  Appendix  A. 6).  This  is  a  valuable  result  that  should  be  kept  in 
mind  throughout  our  development;  however,  our  primary  concern  is  with  character¬ 
izing  solution  points  and  devising  effective  methods  for  finding  them. 

In  an  investigation  of  the  general  problem  (7.1)  we  distinguish  two  kinds  of 
solution  points:  local  minimum  points ,  and  global  minimum  points. 

Definition.  A  point  x*  e  Q  is  said  to  be  a  relative  minimum  point  or  a  local  minimum  point 
of  /  over  £1  if  there  is  an  s  >  0  such  that  fix)  >  fix*)  for  all  x  g  H  within  a  distance  s  of 
x*  (that  is,  x  G  f2  and  |x  -  x*|  <  s).  If  fix)  >  fix*)  for  all  x  G  Q,  x  +  x*,  within  a  distance  s 
of  x*,  then  x*  is  said  to  be  a  strict  relative  minimum  point  of  /  over  f2. 

Definition.  A  point  x*  G  is  said  to  be  a  global  minimum  point  of  /  over  f2  if  fix)  >  fix*) 
for  all  x  G  Q.  If  fix)  >  fix*)  for  all  x  g  Q,  x  ^  x*,  then  x*  is  said  to  be  a  strict  global 
minimum  point  of  /  over  Q. 

In  formulating  and  attacking  problem  (7.1)  we  are,  by  definition,  explicitly  ask¬ 
ing  for  a  global  minimum  point  of  /  over  the  set  Q.  Practical  reality,  however,  both 
from  the  theoretical  and  computational  viewpoint,  dictates  that  we  must  in  many 
circumstances  be  content  with  a  relative  minimum  point.  In  deriving  necessary  con¬ 
ditions  based  on  the  differential  calculus,  for  instance,  or  when  searching  for  the 
minimum  point  by  a  convergent  stepwise  procedure,  comparisons  of  the  values  of 
nearby  points  is  all  that  is  possible  and  attention  focuses  on  relative  minimum  points. 
Global  conditions  and  global  solutions  can,  as  a  rule,  only  be  found  if  the  problem 
possesses  certain  convexity  properties  that  essentially  guarantee  that  any  relative 
minimum  is  a  global  minimum.  Thus,  in  formulating  and  attacking  problem  (7.1) 
we  shall,  by  the  dictates  of  practicality,  usually  consider,  implicitly,  that  we  are 
asking  for  a  relative  minimum  point.  If  appropriate  conditions  hold,  this  will  also  be 
a  global  minimum  point. 


Feasible  Directions 

To  derive  necessary  conditions  satisfied  by  a  relative  minimum  point  x*,  the  basic 
idea  is  to  consider  movement  away  from  the  point  in  some  given  direction.  Along 
any  given  direction  the  objective  function  can  be  regarded  as  a  function  of  a  single 
variable,  the  parameter  defining  movement  in  this  direction,  and  hence  the  ordinary 
calculus  of  a  single  variable  is  applicable.  Thus  given  x  e  Q  we  are  motivated  to  say 
that  a  vector  d  is  a  feasible  direction  at  x  if  there  is  an  a  >  0  such  that  x  +  ad  £  Q 
for  all  a,  0  <  a  <  a.  With  this  simple  concept  we  can  state  some  simple  conditions 
satisfied  by  relative  minimum  points. 
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Proposition  1  (First-Order  Necessary  Conditions).  Let  LI  be  a  subset  ofEn  and  let  f  e  C1 

be  a  function  on  LI.  If  x*  is  a  relative  minimum  point  of  f  over  O,  then  for  any  d  6  En  that 
is  a  feasible  direction  at  x*,  we  have  V/(x*)d  >  0. 

Proof.  For  any  a,  0  <  a  <  a,  the  point  x(a)  =  x*  +  ad  e  LI.  For  0  <  a  <  a  define 
the  function  g(a)  =  f(x(a)).  Then  g  has  a  relative  minimum  at  a  =  0.  A  typical  g  is 
shown  in  Fig.  7.1.  By  the  ordinary  calculus  we  have 

g(a)  -  g( 0)  =  g'(0)a  +  o(a),  (7.2) 

where  o(a)  denotes  terms  that  go  to  zero  faster  than  a  (see  Appendix  A).  If  g'(0)  <  0 
then,  for  sufficiently  small  values  of  a  >  0,  the  right  side  of  (7.2)  will  be  negative, 
and  hence  g(a)  -  g( 0)  <  0,  which  contradicts  the  minimal  nature  of  g(0).  Thus 
g'(0)  =  V/(x*)d  >0.1 

A  very  important  special  case  is  where  x*  is  in  the  interior  of  LI  (as  would  be 
the  case  if  Q  =  En).  In  this  case  there  are  feasible  directions  emanating  in  every 
direction  from  x*,  and  hence  V/(x*)d  >  0  for  all  d  £  En.  This  implies  V/(x*)  =  0. 
We  state  this  important  result  as  a  corollary. 

Corollary  (Unconstrained  Case).  Let  LI  be  a  subset  of  En,  and  let  f  e  C1  be  function’  on 
LI.  If  x*  is  a  relative  minimum  point  of  f  over  £1  and  ifx*  is  an  interior  point  of  LI,  then 
V/(x*)  =  0. 

The  necessary  conditions  in  the  pure  unconstrained  case  lead  to  n  equations 
(one  for  each  component  of  V/)  in  n  unknowns  (the  components  of  x*),  which  in 
many  cases  can  be  solved  to  determine  the  solution.  In  practice,  however,  as  demon¬ 
strated  in  the  following  chapters,  an  optimization  problem  is  solved  directly  without 
explicitly  attempting  to  solve  the  equations  arising  from  the  necessary  conditions. 
Nevertheless,  these  conditions  form  a  foundation  for  the  theory. 


0 

Fig.  7.1  Construction  for  proof 
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7  Basic  Properties  of  Solutions  and  Algorithms 


Example  1.  Consider  the  problem 

minimize  f(x i,  v2)  =  xx  -  X\X2  +  x2  -  3x2. 

There  are  no  constraints,  so  Q  =  E2 .  Setting  the  partial  derivatives  of  /  equal  to  zero 
yields  the  two  equations 


2x\  -  X2  -  0 
-X]  +  2x2  =  3. 


These  have  the  unique  solution  x\  =  1 ,  X2  =  2,  which  is  a  global  minimum  point 

off. 


Example  2.  Consider  the  problem 

minimize  f(x i,  v2)  -  x\  -  x\  +  X2  +  X\X2 
subject  to  x\  >  0,  v2  >  0. 


This  problem  has  a  global  minimum  at  x\  =  X2  =  0.  At  this  point 


df 

——  =  2x\  -  1  +  v2  =  0 

OX  1 


df_ 

dx2 


Thus,  the  partial  derivatives  do  not  both  vanish  at  the  solution,  but  since  any 
feasible  direction  must  have  an  v2  component  greater  than  or  equal  to  zero,  we  have 
V/(x*)d  ^  0  for  all  d  e  E2  such  that  d  is  a  feasible  direction  at  the  point  (1/2,  0). 


7.2  Examples  of  Unconstrained  Problems 

Unconstrained  optimization  problems  occur  in  a  variety  of  contexts,  but  most 
frequently  when  the  problem  formulation  is  simple.  More  complex  formulations 
often  involve  explicit  functional  constraints.  However,  many  problems  with  con¬ 
straints  are  frequently  converted  to  unconstrained  problems,  such  as  using  the  barrier 
functions,  e.g.,  the  analytic  center  problem  for  (dual)  linear  programs.  We  present  a 
few  more  examples  here  that  should  begin  to  indicate  the  wide  scope  to  which  the 
theory  applies. 

Example  1  ( Logistic  Regression ).  Recall  the  classification  problem  where  we  have 
vectors  a ;  e  Ed  for  i  =  1,2,  ...,  ni  in  a  class,  and  vectors  by  e  Ed  for  j  = 
1,2,  . . . ,  U2  not.  Then  we  wish  to  find  y  e  Ed  and  a  number  / 3  such  that 

exp(af  y  +/?) 

1  +  exp(af  y  +  /J) 
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is  close  to  1  for  all  i,  and 

exp(bjy  +/3) 

1  +  exp(bjy  +  ft) 

is  close  to  0  for  all  j.  The  problem  can  be  cast  as  a  unconstrained  optimization 
problem,  called  the  max-likelihood, 


maximize^  p 


n 


exp(af  y  +  /?) 

1  +  exp(afy+/3), 


exp(bJy+/3)  V 
1  +exp(bJy+j6)Jy 


which  can  be  also  equivalently,  using  a  logarithmic  transformation,  written  as 

minimize^  p  Y  log  ( 1  +  exP Vaf  y  -P))  +  Y  log  ( 1  +  exp(bi  y  +  ft)  ■ 

i  j 

Example  2  ( Utility  Maximization).  A  common  problem  in  economic  theory  is  the 
determination  of  the  best  way  to  combine  various  inputs  in  order  to  maximize  a 
utility  function  f(x i,  X2,  . . . ,  xn)  (in  the  monetary  unit)  of  the  amounts  xj  of  the 
inputs,  i  =  1,2,  . . . ,  n.  The  unit  prices  of  the  inputs  are  p\,  p2 ,  . . . ,  pn •  The  pro¬ 
ducer  wishing  to  maximize  profit  must  solve  the  problem 


maximize  f(x i,  X2,  . . . ,  xn )  -  p\X\  -  P2X2  ...  -  pnXn- 


The  first-order  necessary  conditions  are  that  the  partial  derivatives  with  respect 
to  the  Xi  s  each  vanish.  This  leads  directly  to  the  n  equations 


dxt 


X2i  .  .  .  ?  Xyi)  Pi, 


i  -  1,2,  . . . ,  n. 


These  equations  can  be  interpreted  as  stating  that,  at  the  solution,  the  marginal  value 
due  to  a  small  increase  in  the  ith  input  must  be  equal  to  the  price  pi. 


Example  3  (Parametric  Estimation ).  A  common  use  of  optimization  is  for  the 
purpose  of  function  approximation.  Suppose,  for  example,  that  through  an  exper¬ 
iment  the  value  of  a  function  g  is  observed  at  m  points,  x\,  X2,  . . . ,  xm.  Thus,  values 
g(vi),  g(x2),  . . . ,  g(xm)  are  known.  We  wish  to  approximate  the  function  by  a  poly¬ 
nomial 

h(x)  =  anxn  +  an-\xn~l  +  . . .  +  ao 

of  degree  n  (or  less),  where  n  <  m.  Corresponding  to  any  choice  of  the  approximating 
polynomial,  there  will  be  a  set  of  errors  =  g(xk)  -  h(xk).  We  define  the  best 
approximation  as  the  polynomial  that  minimizes  the  sum  of  the  squares  of  these 
errors;  that  is,  minimizes 

m 

Y (£k)2- 

k={ 
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7  Basic  Properties  of  Solutions  and  Algorithms 


This  in  turn  means  that  we  minimize 

m 

2 

/(a)  =  2_J  [#(**)  -  (anXnk  +  an-ixnk~l  +  . . .  +  a0)] 
k=  1 

with  respect  to  a  =  (ao,  a\,  . . . ,  an)  to  find  the  best  coefficients.  This  is  a  quadratic 
expression  in  the  coefficients  a.  To  find  a  compact  representation  for  this  objective 

m  m  m 

we  define  ql}  =  £  b,  =  2  g(xk)(xky  and  c  =  £  g(xt)2-  Then  after  a  bit  of 

*=1  k=  1  *=1 

algebra  it  can  be  shown  that 


/(a)  =  arQa  -  2bra  +  c 
where  Q  =  [^7],  b  =  (fci,  Z?2,  •  •  • ,  bn+ 1). 

The  first-order  necessary  conditions  state  that  the  gradient  of  /  must  vanish.  This 
leads  directly  to  the  system  of  n  +  1  equations 

Qa  =  b. 


These  can  be  solved  to  determine  a. 

Example  4  ( Selection  Problem).  It  is  often  necessary  to  select  an  assortment  of  fac¬ 
tors  to  meet  a  given  set  of  requirements.  An  example  is  the  problem  faced  by  an 
electric  utility  when  selecting  its  power-generating  facilities.  The  level  of  power 
that  the  company  must  supply  varies  by  time  of  the  day,  by  day  of  the  week,  and 
by  season.  Its  power-generating  requirements  are  summarized  by  a  curve,  h(x),  as 
shown  in  Fig.  7.2a,  which  shows  the  total  hours  in  a  year  that  a  power  level  of  at 
least  v  is  required  for  each  v.  For  convenience  the  curve  is  normalized  so  that  the 
upper  limit  is  unity. 

The  power  company  may  meet  these  requirements  by  installing  generating  equip¬ 
ment,  such  as  (7. 1)  nuclear  or  (7.2)  coal-fired,  or  by  purchasing  power  from  a  central 
energy  grid.  Associated  with  type  i(i  =  1,  2)  of  generating  equipment  is  a  yearly 
unit  capital  cost  bi  and  a  unit  operating  cost  q.  The  unit  price  of  power  purchased 
from  the  grid  is  C3 . 

Nuclear  plants  have  a  high  capital  cost  and  low  operating  cost,  so  they  are  used  to 
supply  a  base  load.  Coal-fired  plants  are  used  for  the  intermediate  level,  and  power 
is  purchased  directly  only  for  peak  demand  periods.  The  requirements  are  satisfied 
as  shown  in  Fig.  7.2b,  where  x\  and  denote  the  capacities  of  the  nuclear  and  coal- 
fired  plants,  respectively.  (For  example,  the  nuclear  power  plant  can  be  visualized 
as  consisting  of  x\/A  small  generators  of  capacity  A,  where  A  is  small.  The  first 
such  generator  is  on  for  about  h( A)  hours,  supplying  Ah(A)  units  of  energy;  the 
next  supplies  Ah(2A)  units,  and  so  forth.  The  total  energy  supplied  by  the  nuclear 
plant  is  thus  the  area  shown.) 
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The  total  cost  is 


f{x  1,  x2)  =  b{Xi  +  b2x2  +  c  1 
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(a) 

Fig.  7.2  (a)  Power  requirement  curve;  (b)  V| 
fired  plants,  respectively 


hours  required 


(b) 

x2  denote  the  capacities  of  the  nuclear  and  coal- 


and  the  company  wishes  to  minimize  this  over  the  set  defined  by 

X\  >  0,  X2>  0,  X\  +  X2  <  1. 

Assuming  that  the  solution  is  interior  to  the  constraints,  by  setting  the  partial 
derivatives  equal  to  zero,  we  obtain  the  two  equations 

b\  +  (c  1  -  c2)h(x  1)  +  (c2  -  c3)h(x  1  +  x2)  =  0 

b2  +  (c2  ~  c3)h(x  1  +  x2)  =  0, 

which  represent  the  necessary  conditions. 

If  x\  =  0,  then  the  general  necessary  condition  theorem  shows  that  the  first  equal¬ 
ity  could  relax  to  >  0.  Likewise,  if  x2  =  0,  then  the  second  equality  could  relax  to 
>  0.  The  case  x\  +  V2  =  1  requires  a  bit  more  analysis  (see  Exercise  2). 


7.3  Second-Order  Conditions 

The  proof  of  Proposition  1  in  Sect.  7.1  is  based  on  making  a  first-order  approx¬ 
imation  to  the  function  /  in  the  neighborhood  of  the  relative  minimum  point. 
Additional  conditions  can  be  obtained  by  considering  higher-order  approximations. 
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The  second-order  conditions,  which  are  defined  in  terms  of  the  Hessian  matrix 
V2/  of  second  partial  derivatives  of  /  (see  Appendix  A),  are  of  extreme  theoret¬ 
ical  importance  and  dominate  much  of  the  analysis  presented  in  later  chapters. 

Proposition  1  (Second-Order  Necessary  Conditions).  Let  Cl  be  a  subset  of  En  and  let 

f  e  C2  be  a  function  on  Cl.  If  x*  is  a  relative  minimum  point  of  f  over  Cl,  then  for  any 
d  £  En  that  is  a  feasible  direction  at  x*  we  have 

i)  V/(x*)d  >  0  (7.3) 

ii)  if  V/(x*)d  =  0,  then  drV2/(x*)d  >  0.  (7.4) 

Proof.  The  first  condition  is  just  Proposition  1,  and  the  second  applies  only  if 
V/(x*)d  =  0.  In  this  case,  introducing  x(cr)  =  x*  +  ad  and  g(a)  =  f(x(a))  as 
before,  we  have,  in  view  of  g'(0)  =  0, 

g(a)  -  g(0)  =  ^g"(0)a2  +  o(a2). 

If  g"(0)  <  0  the  right  side  of  the  above  equation  is  negative  for  sufficiently  small  a 
which  contradicts  the  relative  minimum  nature  of  g(0).  Thus 

g"(0)  =  drV2/(x*)d  >0.1 

Example  1.  For  the  same  problem  as  Example  2  of  Sect.  7.1,  we  have  for  d  = 
(d\9  df) 

V/(x*)d  =  3-d2. 

Thus  condition  (ii)  of  Proposition  1  applies  only  if  d2  =  0.  In  that  case  we  have 
drV2/(x*)d  =  2 d\  >  0,  so  condition  (ii)  is  satisfied. 

Again  of  special  interest  is  the  case  where  the  minimizing  point  is  an  interior 
point  of  Q,  as,  for  example,  in  the  case  of  completely  unconstrained  problems. 
We  then  obtain  the  following  classical  result. 

Proposition  2  (Second-Order  Necessary  Conditions — Unconstrained  Case).  Let  x*  be 

an  interior  point  of  the  set  and  suppose  x*  is  a  relative  minimum  point  over  f2  of  the 

function  f  e  C  .  Then 


i)  V/(x*)  =  0  (7.5) 

ii)  for  all  d,  drV2/(x*)d  >  0.  (7.6) 

For  notational  simplicity  we  often  denote  V2/(x),  the  nxn  matrix  of  the  second 
partial  derivatives  of  /,  the  Hessian  of  /,  by  the  alternative  notation  F(x).  Condi¬ 
tion  (ii)  is  equivalent  to  stating  that  the  matrix  F(x*)  is  positive  semidefinite.  As 
we  shall  see,  the  matrix  F(x*),  which  arises  here  quite  naturally  in  a  discussion  of 
necessary  conditions,  plays  a  fundamental  role  in  the  analysis  of  iterative  methods 
for  solving  unconstrained  optimization  problems.  The  structure  of  this  matrix  is  the 
primary  determinant  of  the  rate  of  convergence  of  algorithms  designed  to  minimize 
the  function  /. 


7.3  Second-Order  Conditions 


187 


Example  2.  Consider  the  problem 

minimize  /(x i,  x^)  -  x\  -  x\x2  +  2x\ 
subject  to  x\  >  0,  X2  >  0. 


If  we  assume  that  the  solution  is  in  the  interior  of  the  feasible  set,  that  is,  if 
x\  >  0,  X2  >  0,  then  the  first-order  necessary  conditions  are 

3x\  -  2x\X2  =  0,  -x\  +  4^2  =  0. 


There  is  a  solution  to  these  at  x\  =  X2  =  0  which  is  a  boundary  point,  but  there  is 
also  a  solution  at  x\  -  6,  X2  =  9.  We  note  that  for  x\  fixed  at  x\  =  6,  the  objective 
attains  a  relative  minimum  with  respect  to  X2  at  X2  =  9.  Conversely,  with  X2  fixed 
at  X2  =  9,  the  objective  attains  a  relative  minimum  with  respect  to  x\  at  x\  =  6. 
Despite  this  fact,  the  point  x\  -  6,  X2  =  9  is  not  a  relative  minimum  point,  because 
the  Hessian  matrix  is 


6xi  -  2x2  -2xi 
-2xi  4 


which,  evaluated  at  the  proposed  solution  x\  =  6,  X2  =  9,  is 


18  -12 
-12  4 


This  matrix  is  not  positive  semidefinite,  since  its  determinant  is  negative.  Thus  the 
proposed  solution  is  not  a  relative  minimum  point. 


Sufficient  Conditions  for  a  Relative  Minimum 

By  slightly  strengthening  the  second  condition  of  Proposition  2  above,  we  obtain  a 
set  of  conditions  that  imply  that  the  point  x*  is  a  relative  minimum.  We  give  here 
the  conditions  that  apply  only  to  unconstrained  problems,  or  to  problems  where  the 
minimum  point  is  interior  to  the  feasible  region,  since  the  corresponding  conditions 
for  problems  where  the  minimum  is  achieved  on  a  boundary  point  of  the  feasible 
set  are  a  good  deal  more  difficult  and  of  marginal  practical  or  theoretical  value. 
A  more  general  result,  applicable  to  problems  with  functional  constraints,  is  given 
in  Chap.  11. 

Proposition  3  (Second-Order  Sufficient  Conditions — Unconstrained  Case).  Let  f  e  C2 

be  function  defined  on  a  region  in  which  the  point  x*is  an  interior  point.  Suppose  in  addition 
that 


i)  V/(x*)  =  0  (7.7) 

ii)  F(x*)  is  positive  definite  (7.8) 


Then  x*  is  a  strict  relative  minimum  point  of  /. 
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Proof.  Since  F(x*)  is  positive  definite,  there  is  an  a  >  0  such  that  for  all  d,  drF(x*) 
d  >  a|d|2.  Thus  by  the  Taylor’s  Theorem  (with  remainder) 

/(X*  +  d)  -  f(x*)  =  idrF(x*)d  +  o(|d|2) 

>  (a/2)|d|2  +  o(|d|2) 

For  small  |d|  the  first  term  on  the  right  dominates  the  second,  implying  that  both 
sides  are  positive  for  small  d.  I 


7.4  Convex  and  Concave  Functions 

In  order  to  develop  a  theory  directed  toward  characterizing  global,  rather  than  local, 
minimum  points,  it  is  necessary  to  introduce  some  sort  of  convexity  assumptions. 
This  results  not  only  in  a  more  potent,  although  more  restrictive,  theory  but  also  pro¬ 
vides  an  interesting  geometric  interpretation  of  the  second-order  sufficiency  result 
derived  above. 

Definition.  A  function  /  defined  on  a  convex  set  is  said  to  be  convex  if,  for  every  Xi ,  x2  e 
and  every  a,  0  <  a  <  1,  there  holds 

f(ax{  +  (1  -  a)x2)  <  +  (1  -  a)f(x2). 

If,  for  every  a,  0  <  a  <  1,  and  x\  +  x2,  there  holds 

f(ax{  +  (1  -  a)x2)  <  u/(xi)  +  (1  -  a)f(x2), 

then  /  is  said  to  be  strictly  convex. 

Several  examples  of  convex  or  nonconvex  functions  are  shown  in  Fig.  7.3. 
Geometrically,  a  function  is  convex  if  the  line  joining  two  points  on  its  graph  lies 
nowhere  below  the  graph,  as  shown  in  Fig.  7.3a,  or,  thinking  of  a  function  in  two 
dimensions,  it  is  convex  if  its  graph  is  bowl  shaped. 

Next  we  turn  to  the  definition  of  a  concave  function. 

Definition.  A  function  g  defined  on  a  convex  set  f2  is  said  to  be  concave  if  the  function 
/  =  —g  is  convex.  The  function  g  is  strictly  concave  if  —g  is  strictly  convex. 


Combinations  of  Convex  Functions 

We  show  that  convex  functions  can  be  combined  to  yield  new  convex  functions  and 
that  convex  functions  when  used  as  constraints  yield  convex  constraint  sets. 

Proposition  1.  Let  f\  and  f2  be  convex  functions  on  the  convex  set  Then  the  function 
fi  +  /2  A  convex  on  fl 
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Fig.  7.3  Convex  and  nonconvex  functions 
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Proof.  Let  xi,  X2  e  Q,  and  0  <  a  <  1.  Then 

Max i  +  (1  -  a)x2)  +  Maxi)  +  (1  -  a)x2) 

<  a[/i(xi)  +  /2(xi)]  +  (1  -  a)[/i(x2)  +  /2(x2)].  I 

Proposition  2.  Let  f  be  a  convex  function  over  the  convex  set  fl  Then  the  function  af  is 
convex  for  any  a  >  0. 

Proof  Immediate.  I 

Note  that  through  repeated  application  of  the  above  two  propositions  it  follows 
that  a  positive  combination  a\f\  +  <22/2  +  . . .  +  amfm  of  convex  functions  is  again 
convex. 

Finally,  we  consider  sets  defined  by  convex  inequality  constraints. 

Proposition  3.  Let  f  be  a  convex  function  on  a  convex  set  Q.  The  set  Fr  =  (x  :  x  e 
n,  /(x)  <  c )  is  convex  for  every  real  number  c. 

Proof  Let  xi,  X2  e  Tc.  Then  /(x i)  <  c,  /(x 2)  <  c  and  for  0  <  a  <  1, 

f(axt  +  (1  -  a)x2)  <  af(x i)  +  (1  -  a)f(x2)  <  c. 

Thus  ax i  +  (1  -  a)x2  G  Tc.  I 

We  note  that,  since  the  intersection  of  convex  sets  is  also  convex,  the  set  of  points 
simultaneously  satisfying 


f\  (x)  ^  C{,  y^(x)  ^  C 2,  .  .  .  ,  fn(x)  ^  Cm, 

where  each  f  is  a  convex  function,  defines  a  convex  set.  This  is  important  in  math¬ 
ematical  programming,  since  the  constraint  set  is  often  defined  this  way. 


Properties  of  Differentiable  Convex  Functions 

If  a  function  /  is  differentiable,  then  there  are  alternative  characterizations  of  con¬ 
vexity. 

Proposition  4.  Let  f  e  C1.  Then  f  is  convex  over  a  convex  set  if  and  only  if 

m  >  m  +  V/(x)(y  -  X)  (7.9) 

for  all  x,  y  e  iL 

Proof  First  suppose  /  is  convex.  Then  for  all  a,  0  <  a  <  1, 

f(ay  +  (1  -  a)x)  <  af( y)  +  (1  -  a)f(x). 
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Thus  for  0  <  a  <  1 


/(x  +  g(y  -  x))  -  /(x) 
a 


<  m  -  m ■ 


Letting  a  — >  0  we  obtain 


V/(x)  (y  -  x)  <  /(y)  -  /(x). 

This  proves  the  ‘‘only  if”  part. 

Now  assume 

/( y)  >  /(x)  +  V/(x)  (y  -  x) 

for  all  x,  y  e  Q.  Fix  xi,  x2  e  Q  and  or,  0  <  or  <  1.  Setting  x  =  ax i  +  (1  -  cr)x2  and 
alternatively  y  =  xi  or  y  =  X2,  we  have 

f(xO  >  /(x)  +  V/(x)(Xl  -  x)  (7.10) 

/(x2)  > /(x)  +  V/(x)(x2  -  x).  (7.11) 

Multiplying  (7.10)  by  rr  and  (7.11)  by  (1  -  a)  and  adding,  we  obtain 

afixi)  +  (1  -  a)f(x2)  >  fix)  +  Vf(x)[axi  +  (1  -  a)x2  -  x]. 

But  substituting  x  =  ax\  +  (1  -  a)x2,  we  obtain 

a/(Xi)  +  (1  -  a)f(x 2)  >  /(ax i  +  (1  -  a)x2).  I 

The  statement  of  the  above  proposition  is  illustrated  in  Fig.  7.4.  It  can  be  regarded 
as  a  sort  of  dual  characterization  of  the  original  definition  illustrated  in  Fig.  7.3. 
The  original  definition  essentially  states  that  linear  interpolation  between  two  points 
overestimates  the  function,  while  the  above  proposition  states  that  linear  approxima¬ 
tion  based  on  the  local  derivative  underestimates  the  function. 

For  twice  continuously  differentiable  functions,  there  is  another  characterization 
of  convexity. 


Fig.  7.4  Illustration  of  Proposition  4 
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Proposition  5.  Let  f  e  C2.  Then  f  is  convex  over  a  convex  set  f2  containing  an  interior 
point  if  and  only  if  the  Hessian  matrix  F  off  is  positive  semidefinite  throughout  LI. 

Proof  By  Taylor’s  theorem  we  have 

/( y)  =  /(x)  =  V/(x)(y  -  x)  +  l(y  -  x)rF(x  +  a(y  -  x))(y  -  x)  (7.12) 

for  some  cr,  0  <  a  <  1 .  Clearly,  if  the  Hessian  is  everywhere  positive  semidefinite, 
we  have 

/(y)  >  /(x)  +  V/(x)(y  -  x),  (7.13) 

which  in  view  of  Proposition  4  implies  that  /  is  convex. 

Now  suppose  the  Hessian  is  not  positive  semidefinite  at  some  point  x  e  LI. 
By  continuity  of  the  Hessian  it  can  be  assumed,  without  loss  of  generality,  that  x 
is  an  interior  point  of  LI.  There  is  a  y  e  Q  such  that  (y  -  x)rF(x)(y  -  x)  <  0.  Again 
by  the  continuity  of  the  Hessian,  y  may  be  selected  so  that  for  all  or,  0  <  or  <  1, 

(y  -  x)rF(x  +  a( y  -  x))  (y  -  x)  <  0. 

This  in  view  of  (7.12)  implies  that  (7.13)  does  not  hold;  which  in  view  of  Proposi¬ 
tion  4  implies  that  /  is  not  convex.  I 

The  Hessian  matrix  is  the  generalization  to  En  of  the  concept  of  the  curvature  of  a 
function,  and  correspondingly,  positive  definiteness  of  the  Hessian  is  the  generaliza¬ 
tion  of  positive  curvature.  Convex  functions  have  positive  (or  at  least  nonnegative) 
curvature  in  every  direction.  Motivated  by  these  observations,  we  sometimes  refer 
to  a  function  as  being  locally  convex  if  its  Hessian  matrix  is  positive  semidefinite 
in  a  small  region,  and  locally  strictly  convex  if  the  Hessian  is  positive  definite  in 
the  region.  In  these  terms  we  see  that  the  second-order  sufficiency  result  of  the  last 
section  requires  that  the  function  be  locally  strictly  convex  at  the  point  x*.  Thus, 
even  the  local  theory,  derived  solely  in  terms  of  the  elementary  calculus,  is  actually 
intimately  related  to  convexity — at  least  locally.  For  this  reason  we  can  view  the  two 
theories,  local  and  global,  not  as  disjoint  parallel  developments  but  as  complemen¬ 
tary  and  interactive.  Results  that  are  based  on  convexity  apply  even  to  nonconvex 
problems  in  a  region  near  the  solution,  and  conversely,  local  results  apply  to  a  global 
minimum  point. 


7.5  Minimization  and  Maximization  of  Convex  Functions 

We  turn  now  to  the  three  classic  results  concerning  minimization  or  maximization 
of  convex  functions. 


Theorem  1.  Let  f  be  a  convex  function  defined  on  the  convex  set  fl  Then  the  set  T  where  f 
achieves  its  minimum  is  convex,  and  any  relative  minimum  of  f  is  a  global  minimum. 
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Proof.  If  /  has  no  relative  minima  the  theorem  is  valid  by  default.  Assume  now  that 
Co  is  the  minimum  of  /.  Then  clearly  T  =  {x  :  fix)  <  Co,  x  e  Q}  and  this  is  convex 
by  Proposition  3  of  the  last  section. 

Suppose  now  that  x*  e  Q  is  a  relative  minimum  point  of  /,  but  that  there  is 
another  point  yeQ  with  /( y)  <  fix*).  On  the  line  ay  +  (1  -  cr)x*,  0  <  a  <  1  we 
have 

/(ay  +  (1  -  a)x*)  <  af( y)  +  (1  -  a)/(x*)  <  /(x*), 
contradicting  the  fact  that  x*  is  a  relative  minimum  point.  I 

We  might  paraphrase  the  above  theorem  as  saying  that  for  convex  functions,  all 
minimum  points  are  located  together  (in  a  convex  set)  and  all  relative  minima  are 
global  minima.  The  next  theorem  says  that  if  /  is  continuously  differentiable  and 
convex,  then  satisfaction  of  the  first-order  necessary  conditions  are  both  necessary 
and  sufficient  for  a  point  to  be  a  global  minimizing  point. 

Theorem  2.  Let  f  e  C1  be  convex  on  the  convex  set  LI.  If  there  is  a  point  x*  e  £2  such  that, 

for  all  y  e  £2,  V/(x*)(y  -  x*)  >  0,  then  x*  is  a  global  minimum  point  of  f  over  LI. 

Proof  We  note  parenthetically  that  since  y-x*  is  a  feasible  direction  at  x*,  the  given 
condition  is  equivalent  to  the  first-order  necessary  condition  stated  in  Sect.  7.1.  The 
proof  of  the  proposition  is  immediate,  since  by  Proposition  4  of  the  last  section 

/( y)  >  /(x*)  +  V/(x*)(y  -  x*)  >  /(x*).  I 

Next  we  turn  to  the  question  of  maximizing  a  convex  function  over  a  convex  set. 
There  is,  however,  no  analog  of  Theorem  1  for  maximization;  indeed,  the  tendency 
is  for  the  occurrence  of  numerous  nonglobal  relative  maximum  points.  Nevertheless, 
it  is  possible  to  prove  one  important  result.  It  is  not  used  in  subsequent  chapters, 
but  it  is  useful  for  some  areas  of  optimization. 

Theorem  3.  Let  f  be  a  convex  function  defined  on  the  bounded,  closed  convex  set  LI.  If  f 

has  a  maximum  over  LI  it  is  achieved  at  an  extreme  point  of  LI. 

Proof.  Suppose  /  achieves  a  global  maximum  at  x*  e  Q.  We  show  first  that  this 
maximum  is  achieved  at  some  boundary  point  of  LI.  If  x*  is  itself  a  boundary  point, 
then  there  is  nothing  to  prove,  so  assume  x*  is  not  a  boundary  point.  Let  L  be  any 
line  passing  through  the  point  x*.  The  intersection  of  this  line  with  Q  is  an  interval 
of  the  line  L  having  end  points  yi,  y2  which  are  boundary  points  of  LI,  and  we  have 
x*  =  ay  i  +  (1  -  a)y2  for  some  a,  0  <  a  <  1.  By  convexity  of  / 

f(x*)  <  aft  yi)  +  (1  -  a)/(y2)  <  max{/(yi),  /( y2)}. 

Thus  either  /(yi)  or  /(y2)  must  be  at  least  as  great  as  /(x*).  Since  x*  is  a  maximum 
point,  so  is  either  yi  or  y2. 

We  have  shown  that  the  maximum,  if  achieved,  must  be  achieved  at  a  boundary 
point  of  Q.  If  this  boundary  point,  x*,  is  an  extreme  point  of  LI  there  is  nothing 
more  to  prove.  If  it  is  not  an  extreme  point,  consider  the  intersection  of  Q  with  a 
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supporting  hyperplane  H  at  x*.  This  intersection,  T i,  is  of  dimension  n—  1  or  less 
and  the  global  maximum  of  /  over  T\  is  equal  to  /(x*)  and  must  be  achieved  at  a 
boundary  point  x\  of  T\ .  If  this  boundary  point  is  an  extreme  point  of  T\ ,  it  is  also  an 
extreme  point  of  £1  by  Lemma  1,  Sect.  B.4,  and  hence  the  theorem  is  proved.  If  xi  is 
not  an  extreme  point  of  74,  we  form  74,  the  intersection  of  T\  with  a  hyperplane  in 
En~l  supporting  T\  at  xi .  This  process  can  continue  at  most  a  total  of  n  times  when  a 
set  Tn  of  dimension  zero,  consisting  of  a  single  point,  is  obtained.  This  single  point 
is  an  extreme  point  of  Tn  and  also,  by  repeated  application  of  Lemma  1,  Sect.  B.4, 
an  extreme  point  of  Q.  I 


*7.6  *  Zero-Order  Conditions 

We  have  considered  the  problem 


minimize  /(x) 

subject  to  xeQ  (7.14) 

to  be  unconstrained  because  there  are  no  functional  constraints  of  the  form  g(x)  <  b 
or  h(x)  =  c.  However,  the  problem  is  of  course  constrained  by  the  set  Q.  This 
constraint  influences  the  first-  and  second-order  necessary  and  sufficient  conditions 
through  the  relation  between  feasible  directions  and  derivatives  of  the  function  /. 
Nevertheless,  there  is  a  way  to  treat  this  constraint  without  reference  to  derivatives. 
The  resulting  conditions  are  then  of  zero  order.  These  necessary  conditions  require 
that  the  problem  be  convex  is  a  certain  way,  while  the  sufficient  conditions  require 
no  assumptions  at  all.  The  simplest  assumptions  for  the  necessary  conditions  are  that 
Q  is  a  convex  set  and  that  /  is  a  convex  function  on  all  of  En . 


Fig.  7.5  The  epigraph,  the  tubular  region,  and  the  hyperplane 
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To  derive  the  necessary  conditions  under  these  assumptions  consider  the  set  T  c 
En+l  =  {(r,  x)  :  r  >  /(x),  x  e  FT1}.  In  a  figure  of  the  graph  of  /,  the  set  T  is  the 
region  above  the  graph,  shown  in  the  upper  part  of  Fig.  7.5.  This  set  is  called  the 
epigraph  of  /.  It  is  easy  to  verify  that  the  set  T  is  convex  if  /  is  a  convex  function. 

Suppose  that  x*  e  Q  is  the  minimizing  point  with  value  /*  =  /(x*).  We  construct 
a  tubular  region  with  cross  section  Q  and  extending  vertically  from  -oo  up  to  /*, 
shown  as  B  in  the  upper  part  of  Fig.  7.5.  This  is  also  a  convex  set,  and  it  overlaps 
the  set  T  only  at  the  boundary  point  (/*,  b*)  above  x*(or  possibly  many  boundary 
points  if  /  is  flat  near  x*). 

According  to  the  separating  hyperplane  theorem  (Appendix  B),  there  is  a  hyper¬ 
plane  separating  these  two  sets.  This  hyperplane  can  be  represented  by  a  nonzero 
vector  of  the  form  ( s ,  A)  e  En+l  with  s  a  scalar  and  A  e  En,  and  a  separation  constant 
c.  The  separation  conditions  are 

sr  +  Atx  >  c  for  all  x  €  En  and  r  >  fix)  (7.15) 

sr  +  Atx  <  c  for  all  x  e  Q  and  r  <  /*.  (7.16) 

It  follows  that  s  ^  0;  for  otherwise  A  ±  0  and  then  (7.15)  would  be  violated  for  some 
x  e  En.  It  also  follows  that  s  >  0  since  otherwise  (7.16)  would  be  violated  by  very 
negative  values  of  r.  Hence,  together  we  find  s  >  0  and  by  appropriate  scaling  we 
may  take  s  =  1 . 

It  is  easy  to  see  that  the  above  conditions  can  be  expressed  alternatively  as  two 
optimization  problems,  as  stated  in  the  following  proposition. 

Proposition  1  (Zero-Order  Necessary  Conditions).  If  x*  solves  (7.14)  under  the  stated 
convexity  conditions,  then  there  is  a  nonzero  vector  A  e  En  such  that  x*  is  a  solution  to  the 
two  problems: 

T 

minimize  /(x)  +  A  x 

subject  to  x  e  En  (7.17) 

and 

maximize  A  x 

subject  to  x  e  Q.  (7.18) 

Proof.  Problem  (7.17)  follows  from  (7.15)  (with  s  =  1)  and  the  fact  that  fix)  <  r 
for  r  >  fix).  The  value  c  is  attained  from  above  at  (/*,  x*).  Likewise  (7.18)  follows 
from  (7.16)  and  the  fact  that  x*  and  the  appropriate  r  attain  c  from  below.  I 

Notice  that  problem  (7.17)  is  completely  unconstrained,  since  x  may  range  over 
all  of  En.  The  second  problem  (7.18)  is  constrained  by  Q.  but  has  a  linear  objective 
function.  It  is  clear  from  Fig.  7.5  that  the  slope  of  the  hyperplane  is  equal  to  the 
slope  of  the  function  /  when  /  is  continuously  differentiable  at  the  solution  x* . 

If  the  optimal  solution  x*  is  in  the  interior  of  Q,  then  the  second  problem  (7.18) 
implies  that  A  =  0,  for  otherwise  there  would  be  a  direction  of  movement  from  x* 
that  increases  the  product  A  x  above  A  x*.  The  hyperplane  is  horizontal  in  that  case. 


196 


7  Basic  Properties  of  Solutions  and  Algorithms 


The  zeroth-order  conditions  provide  no  new  information  in  this  situation.  However, 
when  the  solution  is  on  a  boundary  point  of  Q  the  conditions  give  very  useful  infor¬ 
mation. 

Example  1  ( Minimization  Over  an  Interval).  Consider  a  continuously  differentiable 
function  /  of  a  single  variable  x  e  El  defined  on  the  unit  interval  [0,1]  which  plays 
the  role  of  £1  here.  The  first  problem  (7.17)  implies  f'(x*)  =  -A.  If  the  solution  is 
at  the  left  end  of  the  interval  (at  v  =  0)  then  the  second  problem  (7.18)  implies  that 
A  <  0  which  means  that  f'(x*)  >0.  The  reverse  holds  if  v*  is  at  the  right  end.  These 
together  are  identical  to  the  first-order  conditions  of  Sect.  7.1. 

Example  2.  As  a  generalization  of  the  above  example,  let  /  e  C1  on  En,  and  let  / 
have  a  minimum  with  respect  to  Q  at  x*.  Let  d  e  En  be  a  feasible  direction  at  x*. 
Then  it  follows  again  from  (7.17)  that  V/(x*)d  >  0. 

Sufficient  Conditions  Theorem.  The  conditions  of  Proposition  1  are  sufficient  for  x*  to  be 
a  minimum  even  without  the  convexity  assumptions. 

Proposition  2  (Zero-Order  Sufficiency  Conditions).  If  there  is  a  A  such  thatx*  e  Q  solves 
the  problems  (7.17)  and  (7.18),  then  x*  solves  (7.14). 

Proof.  Suppose  x\  is  any  other  point  in  12.  Then  from  (7.17) 

/(x i)  +  ATxi  >  f(x*)  +  Atx*. 


This  can  be  rewritten  as 


/(Xi)  -f(x*)  >  Arx  -  ATX 1. 

By  problem  (7.18)  the  right  hand  side  of  this  is  greater  than  or  equal  to  zero.  Hence 
f(x i)  -  /(x*)  >  0  which  establishes  the  result.  I 


7.7  Global  Convergence  of  Descent  Algorithms 

A  good  portion  of  the  remainder  of  this  book  is  devoted  to  presentation  and  analysis 
of  various  algorithms  designed  to  solve  nonlinear  programming  problems.  Although 
these  algorithms  vary  substantially  in  their  motivation,  application,  and  detailed 
analysis,  ranging  from  the  simple  to  the  highly  complex,  they  have  the  common 
heritage  of  all  being  iterative  descent  algorithms.  By  iterative ,  we  mean,  roughly, 
that  the  algorithm  generates  a  series  of  points,  each  point  being  calculated  on  the 
basis  of  the  points  preceding  it.  By  descent ,  we  mean  that  as  each  new  point  is 
generated  by  the  algorithm  the  corresponding  value  of  some  function  (evaluated  at 
the  most  recent  point)  decreases  in  value.  Ideally,  the  sequence  of  points  generated 
by  the  algorithm  in  this  way  converges  in  a  finite  or  infinite  number  of  steps  to  a 
solution  of  the  original  problem. 
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An  iterative  algorithm  is  initiated  by  specifying  a  starting  point.  If  for  arbitrary 
starting  points  the  algorithm  is  guaranteed  to  generate  a  sequence  of  points  con¬ 
verging  to  a  solution,  then  the  algorithm  is  said  to  be  globally  convergent.  Quite 
definitely,  not  all  algorithms  have  this  obviously  desirable  property.  Indeed,  many  of 
the  most  important  algorithms  for  solving  nonlinear  programming  problems  are  not 
globally  convergent  in  their  purest  form  and  thus  occasionally  generate  sequences 
that  either  do  not  converge  at  all  or  converge  to  points  that  are  not  solutions.  It  is 
often  possible,  however,  to  modify  such  algorithms,  by  appending  special  devices, 
so  as  to  guarantee  global  convergence. 

Fortunately,  the  subject  of  global  convergence  can  be  treated  in  a  unified  manner 
through  the  analysis  of  a  general  theory  of  algorithms  developed  mainly  by  Zang- 
will.  From  this  analysis,  which  is  presented  in  this  section,  we  derive  the  Global 
Convergence  Theorem  that  is  applicable  to  the  study  of  any  iterative  descent  algo¬ 
rithm.  Frequent  reference  to  this  important  result  is  made  in  subsequent  chapters. 


Iterative  Algorithms 

We  think  of  an  algorithm  as  a  mapping.  Given  a  point  x  in  some  space  A,  the  output 
of  an  algorithm  applied  to  x  is  a  new  point.  Operated  iteratively,  an  algorithm  is 
repeatedly  reapplied  to  the  new  points  it  generates  so  as  to  produce  a  whole  sequence 
of  points.  Thus,  as  a  preliminary  definition,  we  might  formally  define  an  algorithm  A 
as  a  mapping  taking  points  in  a  space  A  into  (other)  points  in  X.  Operated  iteratively, 
the  algorithm  A  initiated  at  xq  e  X  would  generate  the  sequence  {x*}  defined  by 


X£+i  =  A(x*;). 


In  practice,  the  mapping  A  might  be  defined  explicitly  by  a  simple  mathematical 
expression  or  it  might  be  defined  implicitly  by,  say,  a  lengthy  complex  computer 
program.  Given  an  input  vector,  both  define  a  corresponding  output. 

With  this  intuitive  idea  of  an  algorithm  in  mind,  we  now  generalize  the  concept 
somewhat  so  as  to  provide  greater  flexibility  in  our  analyses. 

Definition.  An  algorithm  A  is  a  mapping  defined  on  a  space  X  that  assigns  to  every  point 

x  e  X  a  subset  of  X. 

In  this  definition  the  term  “space”  can  be  interpreted  loosely.  Usually  X  is  the 
vector  space  En  but  it  may  be  only  a  subset  of  En  or  even  a  more  general  metric 
space.  The  most  important  aspect  of  the  definition,  however,  is  that  the  mapping  A, 
rather  than  being  a  point-to-point  mapping  of  A,  is  a  point- to -set  mapping  of  A. 

An  algorithm  A  generates  a  sequence  of  points  in  the  following  way.  Given 
Xk  £  X  the  algorithm  yields  A(xQ  which  is  a  subset  of  A.  From  this  subset  an  ar¬ 
bitrary  element  x^+i  is  selected.  In  this  way,  given  an  initial  point  xo,  the  algorithm 
generates  sequences  through  the  iteration 


Xk+1  £  A(Xfc). 
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It  is  clear  that,  unlike  the  case  where  A  is  a  point-to-point  mapping,  the  sequence 
generated  by  the  algorithm  A  cannot,  in  general,  be  predicted  solely  from  knowledge 
of  the  initial  point  xo.  This  degree  of  uncertainty  is  designed  to  reflect  uncertainty 
that  we  may  have  in  practice  as  to  specific  details  of  an  algorithm. 

Example  1.  Suppose  for  v  on  the  real  line  we  define 

A(x)  =  [-|x|/2,  W/2] 

so  that  A(v)  is  an  interval  of  the  real  line.  Starting  at  vo  =  100,  each  of  the  sequences 
below  might  be  generated  from  iterative  application  of  this  algorithm. 

100,  50,  25,  12,  -6,  -2,  1,  1/2,... 

100,  -40,  20,  -5,  -2,  1,  1/4,  1/8,... 

100,  10,  -1,1/16,1/100,  -1/1000,  1/10,  100,... 

The  apparent  ambiguity  that  is  built  into  this  definition  of  an  algorithm  is  not  meant 
to  imply  that  actual  algorithms  are  random  in  character.  In  actual  implementation 
algorithms  are  not  defined  ambiguously.  Indeed,  a  particular  computer  program 
executed  twice  from  the  same  starting  point  will  generate  two  copies  of  the  same 
sequence.  In  other  words,  in  practice  algorithms  are  point-to-point  mappings.  The 
utility  of  the  more  general  definition  is  that  it  allows  one  to  analyze,  in  a  single  step, 
the  convergence  of  an  infinite  family  of  similar  algorithms.  Thus,  two  computer  pro¬ 
grams,  designed  from  the  same  basic  idea,  may  differ  slightly  in  some  details,  and 
therefore  perhaps  may  not  produce  identical  results  when  given  the  same  starting 
point.  Both  programs  may,  however,  be  regarded  as  implementations  of  the  same 
point-to-set  mappings.  In  the  example  above,  for  instance,  it  is  not  necessary  to 
know  exactly  how  Xk+i  is  determined  from  Xk  so  long  as  it  is  known  that  its  absolute 
value  is  no  greater  than  one-half  xfs  absolute  value.  The  result  will  always  tend  to¬ 
ward  zero.  In  this  manner,  the  generalized  concept  of  an  algorithm  sometimes  leads 
to  simpler  analysis. 


Descent 

In  order  to  describe  the  idea  of  a  descent  algorithm  we  first  must  agree  on  a  subset 
T  of  the  space  A,  referred  to  as  the  solution  set.  The  basic  idea  of  a  descent  function, 
which  is  defined  below,  is  that  for  points  outside  the  solution  set,  a  single  step  of  the 
algorithm  yields  a  decrease  in  the  value  of  the  descent  function. 

Definition.  Let  T  c  X  be  a  given  solution  set  and  let  A  be  an  algorithm  on  X.  A  continuous 
real-valued  function  Z  on  X  is  said  to  be  a  descent  function  for  T  and  A  if  it  satisfies 


i)  if  x  g  T  and  y  g  A(x),  then  Z(y)  <  Z(x) 

ii)  if  x  e  T  and  y  G  A(x),  then  Z(y)  <  Z(x). 
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There  are  a  number  of  ways  a  solution  set,  algorithm,  and  descent  function  can 
be  defined.  A  natural  set-up  for  the  problem 

minimize  /(x)  (7.19) 

subject  to  x  e  Q 

is  to  let  T  be  the  set  of  minimizing  points,  and  define  an  algorithm  A  on  £2  in  such  a 
way  that  /  decreases  at  each  step  and  thereby  serves  as  a  descent  function.  Indeed, 
this  is  the  procedure  followed  in  a  majority  of  cases.  Another  possibility  for  uncon¬ 
strained  problems  is  to  let  T  be  the  set  of  points  x  satisfying  V/(x)  =  0.  In  this  case 
we  might  design  an  algorithm  for  which  |V/(x)|  serves  as  a  descent  function  or  for 
which  /(x)  serves  as  a  descent  function. 


*  Closed  Mappings 

An  important  property  possessed  by  some  algorithms  is  that  they  are  closed.  This 
property,  which  is  a  generalization  for  point-to-set  mappings  of  the  concept  of  con¬ 
tinuity  for  point-to-point  mappings,  turns  out  to  be  the  key  to  establishing  a  gen¬ 
eral  global  convergence  theorem.  In  defining  this  property  we  allow  the  point-to-set 
mapping  to  map  points  in  one  space  X  into  subsets  of  another  space  Y. 

Definition.  A  point-to-set  mapping  A  from  X  to  Y  is  said  to  be  closed  at  x  e  X  if  the 
assumptions 

i)  xk  ->  x,  X£  G  X, 

ii)  ->  y,  e  A(x*) 
imply 

iii)  y  e  A(x). 

y  v 


Fig.  7.6  Graphs  of  mappings 


The  point-to-set  map  A  is  said  to  be  closed  on  X  if  it  is  closed  at  each  point  of  X. 
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Example  2.  As  a  special  case,  suppose  that  the  mapping  A  is  a  point-to-point  map¬ 
ping;  that  is,  for  each  x  e  X  the  set  A(x)  consists  of  a  single  point  in  Y.  Suppose  also 
that  A  is  continuous  at  x  e  X.  This  means  that  if  x^  — >  x  then  A(x^)  — >  A(x),  and 
it  follows  that  A  is  closed  at  x.  Thus  for  point-to-point  mappings  continuity  implies 
closedness.  The  converse  is,  however,  not  true  in  general. 

The  definition  of  a  closed  mapping  can  be  visualized  in  terms  of  the  graph  of  the 
mapping,  which  is  the  set  {(x,  y)  :  x  e  X,  ye  A(x)}.  If  X  is  closed,  then  A  is  closed 
throughout  X  if  and  only  if  this  graph  is  a  closed  set.  This  is  illustrated  in  Fig.  7.6. 
However,  this  equivalence  is  valid  only  when  considering  closedness  everywhere. 
In  general  a  mapping  may  be  closed  at  some  points  and  not  at  others. 


Example  3.  The  reader  should  verify  that  the  point-to-set  mapping  defined  in 
Example  1  is  closed. 

Many  complex  algorithms  that  we  analyze  are  most  conveniently  regarded  as  the 
composition  of  two  or  more  simple  point-to-set  mappings.  It  is  therefore  natural  to 
ask  whether  closedness  of  the  individual  maps  implies  closedness  of  the  composite. 
The  answer  is  a  qualified  “yes.”  The  technical  details  of  composition  are  described 
in  the  remainder  of  this  subsection.  They  can  safely  be  omitted  at  first  reading  while 
proceeding  to  the  Global  Convergence  Theorem. 


Definition.  Let  A  :  X  — >  Y  and  B  :  Y  — >  Z  be  point-to-set  mappings.  The  composite 
mapping  C  =  BA  is  defined  as  the  point-to-set  mapping  C  :  X  — >  Z  with 


C(x)  = 


U  B(y). 

yeA(x) 


This  definition  is  illustrated  in  Fig.  7.7. 

Proposition.  Let  A  :  X  — »  Y  and  B  :  Y  ^  Z  be  point-to-set  mappings.  Suppose  A  is  closed 
at  x  and  B  is  closed  on  A(x).  Suppose  also  that  ifx *  — >  x  and  y^  e  A(x^),  there  is  a  y  such 
that,  for  some  subsequence  {y^},  y — >  y.  Then  the  composite  mapping  C  =  BA  is  closed 
at  x. 


Proof.  Let  x^  — >  x  and  — >  z  with  z^  e  C(x^).  It  must  be  shown  that  z  e  C(x). 

Select  yk  e  A(x^)  such  that  z^  e  B(y^)  and  according  to  the  hypothesis  let  y  and 
{y&}  be  such  that  y &  — >  y.  Since  A  is  closed  at  x  it  follows  that  y  e  A(x). 

Likewise,  since  y &  — >  y,  z ^  — >  z  and  B  is  closed  at  y,  it  follows  that  z  e  B(y)  c 
BA(x)  =  C(x).  I 

Two  important  corollaries  follow  immediately. 

Corollary  1.  Let  A  :  X  — >  Y  and  B  :  Y  ^  Z  be  point-to-set  mappings.  If  A  is  closed  at  x,  B 
is  closed  on  A(x)  and  Y  is  compact,  then  the  composite  map  C  =  BA  is  closed  at  x. 

Corollary  2.  Let  A  :  X  — »  Y  be  a  point-to-point  mapping  and  B  :  Y  Z  a  point-to- 
set  mapping.  If  A  is  continuous  at  x  and  B  is  closed  at  A(x),  then  the  composite  mapping 
C  =  BA  is  closed  at  x. 
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Fig.  7.7  Composition  of  mappings 


Global  Convergence  Theorem 

The  Global  Convergence  Theorem  is  used  to  establish  convergence  for  the  follow¬ 
ing  general  situation.  There  is  a  solution  set  T.  Points  are  generated  according  to 
the  algorithm  x^+i  e  A(x^),  and  each  new  point  always  strictly  decreases  a  descent 
function  Z  unless  the  solution  set  T  is  reached.  For  example,  in  nonlinear  program¬ 
ming,  the  solution  set  may  be  the  set  of  minimum  points  (perhaps  only  one  point), 
and  the  descent  function  may  be  the  objective  function  itself.  A  suitable  algorithm 
is  found  that  generates  points  such  that  each  new  point  strictly  reduces  the  value  of 
the  objective.  Then,  under  appropriate  conditions,  it  follows  that  the  sequence  con¬ 
verges  to  the  solution  set.  The  Global  Convergence  Theorem  establishes  technical 
conditions  for  which  convergence  is  guaranteed. 

Global  Convergence  Theorem.  Let  A  be  an  algorithm  on  X,  and  suppose  that,  given  Xo  the 
sequence  {x^}^L0  is  generated  satisfying 

x^+i  c.  A(xjt). 

Let  a  solution  setT  c  X  be  given,  and  suppose 

i )  all  points  xk  are  contained  in  a  compact  set  S  c  X 

ii )  there  is  a  continuous  function  Z  on  X  such  that 

(a)  ifx  £  T,  then  Z(y)  <  Z(x)for  all  y  e  A(x) 

(b)  ifx  e  T,  then  Z(y)  <  Z(x)for  all  y  e  A(x) 

Hi)  the  mapping  A  is  closed  at  points  outside  T. 

Then  the  limit  of  any  convergent  subsequence  of{xk  \  is  a  solution. 

Proof.  Suppose  the  convergent  subsequence  {x^},  k  e  7C  converges  to  the  limit  x. 
Since  Z  is  continuous,  it  follows  that  for  k  e  7C,  Z(xk)  — >  Z(x).  This  means  that  Z  is 
convergent  with  respect  to  the  subsequence,  and  we  shall  show  that  it  is  convergent 
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with  respect  to  the  entire  sequence.  By  the  monotonicity  of  Z  on  the  sequence  {x*} 
we  have  Z{xf)  -  Z(x)  ^  0  for  all  k.  By  the  convergence  of  Z  on  the  subsequence, 
there  is,  for  a  given  s  >  0,  a  K  £  7C  such  that  Z(x^)  -  Z(x)  <  s  for  all  k>  K,  k  £  7C. 

Thus  for  all  k  >  K 

Z(Xfc)  -  Z(x)  =  Z(xjt)  -  Z(Xtf)  +  Z(x* )  -  Z(x)  <  s, 
which  shows  that  Z(x^)  — >  Z(x). 

To  complete  the  proof  it  is  only  necessary  to  show  that  x  is  a  solution.  Sup¬ 
pose  x  is  not  a  solution.  Consider  the  subsequence  {xk+i}<x-  Since  all  members  of 
this  sequence  are  contained  in  a  compact  set,  there  is  a  *K  c  7C  such  that  {Xk+\}^ 
converges  to  some  limit  x.  We  thus  have  x^  — »  x,  k  £  7C,  and  x^+i  e  A(x^)  with 
x^+i  — >  x,  k  £  EC.  Thus  since  A  is  closed  at  x  it  follows  that  x  e  A(x).  But  from 
above,  Z(x)  =  Z(x)  which  contradicts  the  fact  that  Z  is  a  descent  function.  I 

Corollary.  If  under  the  conditions  of  the  Global  Convergence  Theorem  T  consists  of  a 

single  point  x,  then  the  sequence  {x*}  converges  to  x. 

Proof.  Suppose  to  the  contrary  that  there  is  a  subsequence  {x^}^  and  an  s  >  0  such 
that  \xk  -  x|  >  s  for  all  k  £  7C.  By  compactness  there  must  be  *K'  c  7C  such  that 
{x^}^/,  converges,  say  to  x'.  Clearly,  |x'  -  x|  >  s,  but  by  the  Global  Convergence 
Theorem  x'  £  T,  which  is  a  contradiction.  I 

In  later  chapters  the  Global  Convergence  Theorem  is  used  to  establish  the  con¬ 
vergence  of  several  standard  algorithms.  Here  we  consider  some  simple  examples 
designed  to  illustrate  the  roles  of  the  various  conditions  of  the  theorem. 

Example  4.  In  many  respects  condition  (iii)  of  the  theorem,  the  closedness  of  A  out¬ 
side  the  solution  set,  is  the  most  important  condition.  The  failure  of  many  popular 
algorithms  can  be  traced  to  nonsatisfaction  of  this  condition.  On  the  real  line  con¬ 
sider  the  point-to-point  algorithm 


[  Ux  -  1)  +  1  v  >  1 

Mx)  =  : 

y^x  x <  l 

and  the  solution  set  T  =  {0}.  It  is  easily  verified  that  a  descent  function  for  this 
solution  set  and  this  algorithm  is  Z(x)  =  \x\.  However,  starting  from  v  >  1,  the 
algorithm  generates  a  sequence  converging  to  x  =  l  which  is  not  a  solution.  The 
difficulty  is  that  A  is  not  closed  at  v  =  1 . 

Example  5.  On  the  real  line  X  consider  the  solution  set  to  be  empty,  the  descent 
function  Z(x)  =  e~x ,  and  the  algorithm  A(v)  =  v  +  1 .  All  conditions  of  the  conver¬ 
gence  theorem  except  (i)  hold.  The  sequence  generated  from  any  starting  condition 
diverges  to  infinity.  This  is  not  strictly  a  violation  of  the  conclusion  of  the  theorem 
but  simply  an  example  illustrating  that  if  no  compactness  assumption  is  introduced, 
the  generated  sequence  may  have  no  convergent  subsequence. 
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Example  6.  Consider  the  point-to-set  algorithm  A  defined  by  the  graph  in  Fig.  7.8 
and  given  explicitly  on  X  =  [0,  1]  by 


A(x)  = 


[0,  x)  l  >  x  >  0 

0  v  =  0, 


where  [0,  x)  denotes  a  half-open  interval  (see  Appendix  A).  Letting  T  =  {0},  the 
function  Z(x)  =  v  serves  as  a  descent  function,  because  for  v  A  0  all  points  in  A(x) 
are  less  than  v. 


A  U) 


Fig.  7.8  Graph  for  Example  6 


The  sequence  defined  by 


*o  =  1 

1 

Xk+1  ~  Xk  ~  2kE> 

satisfies  Xk+i  c  A(xk)  but  it  can  easily  be  seen  that  Xk  — >  \  £  T.  The  difficulty  here, 
of  course,  is  that  the  algorithm  A  is  not  closed  outside  the  solution  set. 


*  Spacer  Steps 

In  some  of  the  more  complex  algorithms  presented  in  later  chapters,  the  rule  used  to 
determine  a  succeeding  point  in  an  iteration  may  depend  on  several  previous  points 
rather  than  just  the  current  point,  or  it  may  depend  on  the  iteration  index  k.  Such 
features  are  generally  introduced  in  order  to  obtain  a  rapid  rate  of  convergence  but 
they  can  grossly  complicate  the  analysis  of  global  convergence. 
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If  in  such  a  complex  sequence  of  steps  there  is  inserted,  perhaps  irregularly  but 
infinitely  often,  a  step  of  an  algorithm  such  as  steepest  descent  that  is  known  to 
converge,  then  it  is  not  difficult  to  insure  that  the  entire  complex  process  converges. 
The  step  which  is  repeated  infinitely  often  and  guarantees  convergence  is  called  a 
spacer  step,  since  it  separates  disjoint  portions  of  the  complex  sequence.  Essentially 
the  only  requirement  imposed  on  the  other  steps  of  the  process  is  that  they  do  not 
increase  the  value  of  the  descent  function. 

This  type  of  situation  can  be  analyzed  easily  from  the  following  viewpoint. 
Suppose  B  is  an  algorithm  which  together  with  the  descent  function  Z  and  solu¬ 
tion  set  T,  satisfies  all  the  requirements  of  the  Global  Convergence  Theorem.  Define 
the  algorithm  C  by  C(x)  =  {y  :  Z(y)  <  Z(x)}.  In  other  words,  C  applied  to  x  can 
give  any  point  so  long  as  it  does  not  increase  the  value  of  Z.  It  is  easy  to  verify  that 
C  is  closed.  We  imagine  that  B  represents  the  spacer  step  and  the  complex  process 
between  spacer  steps  is  just  some  realization  of  C.  Thus  the  overall  process  amounts 
merely  to  repeated  applications  of  the  composite  algorithm  CB.  With  this  viewpoint 
we  may  state  the  Spacer  Step  Theorem. 

Spacer  Step  Theorem.  Suppose  B  is  an  algorithm  on  X  which  is  closed  outside  the  solution 

set  T.  Let  Z  he  a  descent  function  corresponding  to  B  and  T. 

Suppose  that  the  sequence  {Xk)f=0  is  generated  satisfying 

X*+l  e  B(Xfr) 

for  k  in  an  infinite  index  set  7C,  and  that 


Z(x*+i)  <  Z(x*) 


for  all  k.  Suppose  also  that  the  set  S  =  {x  :  Z(x)  <  Z(xq)}  is  compact.  Then  the  limit  of  any 
convergent  subsequence  of{Xk}%  is  a  solution. 

Proof.  We  first  define  for  any  x  e  X,  B(x)  =  S  Pi  B(x)  and  then  observe  that  A  =  CB 
is  closed  outside  the  solution  set  by  Corollary  1 .  The  Global  Convergence  Theorem 
can  then  be  applied  to  A.  Since  S  is  compact,  there  is  a  subsequence  of  { x k}ke% 
converging  to  a  limit  x.  In  view  of  the  above  we  conclude  that  x  e  T.I 


7.8  Speed  of  Convergence 

The  study  of  speed  of  convergence  is  an  important  but  sometimes  complex  subject. 
Nevertheless,  there  is  a  rich  and  yet  elementary  theory  of  convergence  rates  that 
enables  one  to  predict  with  confidence  the  relative  effectiveness  of  a  wide  class  of 
algorithms.  In  this  section  we  introduce  various  concepts  designed  to  measure  speed 
of  convergence,  and  prepare  for  a  study  of  this  most  important  aspect  of  nonlinear 
programming. 
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Order  of  Convergence 


Consider  a  sequence  of  real  numbers  {n}™=0  converging  to  the  limit  r*.  We  define 
several  notions  related  to  the  speed  of  convergence  of  such  a  sequence. 

Definition.  Let  the  sequence  [rk]  converge  to  r*.  The  order  of  convergence  of  [rk]  is  defined 
as  the  supremum  of  the  nonnegative  numbers  p  satisfying 

A  ^  —  lnt+i  -  r*  | 

0  <  lim  - - —  <  co. 

k — >oo  \rk  —  r*\P 


To  ensure  that  the  definition  is  applicable  to  any  sequence,  it  is  stated  in  terms 
of  limit  superior  rather  than  just  limit  and  0/0  (which  occurs  if  r \  =  r*  for  all  k ) 
is  regarded  as  finite.  But  these  technicalities  are  rarely  necessary  in  actual  analysis, 
since  the  sequences  generated  by  algorithms  are  generally  quite  well  behaved. 

It  should  be  noted  that  the  order  of  convergence,  as  with  all  other  notions  related 
to  speed  of  convergence  that  are  introduced,  is  determined  only  by  the  properties 
of  the  sequence  that  hold  as  k  — »  oo.  Somewhat  loosely  but  picturesquely,  we  are 
therefore  led  to  refer  to  the  tail  of  a  sequence — that  part  of  the  sequence  that  is 
arbitrarily  far  out.  In  this  language  we  might  say  that  the  order  of  convergence  is  a 
measure  of  how  good  the  worst  part  of  the  tail  is.  Larger  values  of  the  order  p  imply, 
in  a  sense,  faster  convergence,  since  the  distance  from  the  limit  r*  is  reduced,  at  least 
in  the  tail,  by  the  pth  power  in  a  single  step.  Indeed,  if  the  sequence  has  order  p  and 
(as  is  the  usual  case)  the  limit 


B  =  lim 

>oo 


kfc+i  -  r * 
I  rk  -  r*\p 


exists,  then  asymptotically  we  have 


\n+ 1  -  r*\  =P\ rk  -  r*\p. 


Example  1.  The  sequence  with  r \  =  ak  where  0  <  a  <  1  converges  to  zero  with 
order  unity,  since  rk+\lrk  =  a. 

Example  2.  The  sequence  with  rk  =  a{2k)  for  0  <  a  <  1  converges  to  zero  with  order 
two,  since  rk+\!r2  -  1. 


Linear  Convergence 


Most  algorithms  discussed  in  this  book  have  an  order  of  convergence  equal  to  unity. 
It  is  therefore  appropriate  to  consider  this  class  in  greater  detail  and  distinguish 
certain  cases  within  it. 


Definition.  If  the  sequence  {rk}  converges  to  r*  in  such  a  way  that 


lim 

k— >oo 


kfc+i  ~  r  * 
\rk-r*\ 


=  P<1, 


the  sequence  is  said  to  converge  linearly  to  r*  with  convergence  ratio  {or  rate)  [5. 
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Linear  convergence  is,  for  our  purposes,  without  doubt  the  most  important  type 
of  convergence  behavior.  A  linearly  convergent  sequence,  with  convergence  ratio  / 3 , 
can  be  said  to  have  a  tail  that  converges  at  least  as  fast  as  the  geometric  sequence 
c[3k  for  some  constant  c.  Thus  linear  convergence  is  sometimes  referred  to  as  geo¬ 
metric  convergence ,  although  in  this  book  we  reserve  that  phrase  for  the  case  when 
a  sequence  is  exactly  geometric. 

As  a  rule,  when  comparing  the  relative  effectiveness  of  two  competing  algorithms 
both  of  which  produce  linearly  convergent  sequences,  the  comparison  is  based  on 
their  corresponding  convergence  ratios — the  smaller  the  ratio  the  faster  the  rate. 
The  ultimate  case  where  [3  =  0  is  referred  to  as  superlinear  convergence.  We  note 
immediately  that  convergence  of  any  order  greater  than  unity  is  superlinear,  but  it  is 
also  possible  for  superlinear  convergence  to  correspond  to  unity  order. 

Example  3.  The  sequence  r \  =  (1  /k)k  is  of  order  unity,  since  r^+i  /rPk  — >  oo  for  p  >  1. 
However,  r^+i/r^  — >  0  as  k  — >  oo  and  hence  this  is  superlinear  convergence. 


Arithmetic  Convergence 

Linear  convergence  is  also  called  geometric  convergence.  There  is  another  (slower) 
type  of  convergence: 

Definition.  If  the  sequence  {rk}  converges  to  r*  in  such  a  way  that 

Vk  -r\<  C — — — ,  k>  1,  0  <  p  <  oo 
kp 

where  C  is  a  fixed  positive  number,  the  sequence  is  said  to  converge  arithmetically  to  r* 
with  order  p. 

When  p  =  1,  it  is  referred  as  arithmetic  convergence.  The  greater  of  p  the  faster  of 
the  convergence. 

Example  4.  The  sequence  rk  -  l/k  converges  to  zero  arithmetically.  The  conver¬ 
gence  is  of  order  one  but  it  is  not  linear,  since  lim(r^+i/r^)  =  1,  that  is,  [3  is  not 

k-^oo 

strictly  less  than  one. 


*  Average  Rates 

All  the  definitions  given  above  can  be  referred  to  as  step-wise  concepts  of  conver¬ 
gence,  since  they  define  bounds  on  the  progress  made  by  going  a  single  step:  from  k 
to  k  +  1.  Another  approach  is  to  define  concepts  related  to  the  average  progress  per 
step  over  a  large  number  of  steps.  We  briefly  illustrate  how  this  can  be  done. 


7.8  Speed  of  Convergence 


207 


Definition.  Let  the  sequence  {?>}  converge  to  r*.  The  average  order  of  convergence  is  the 
inhmum  of  the  numbers  p  >  1  such  that 

lim  | rk  -  r* -  1. 

>oo 

The  order  is  inhnity  if  the  equality  holds  for  no  p  >  1 . 

Example  5.  For  the  sequence  r \  =  a(2k\  0  <  a  <  1,  given  in  Example  2,  we  have 

\rk\l/2k  =  a, 

while 

\rk\x'pk  =  a^k  ->  1 

for  p  >  2.  Thus  the  average  order  is  two. 

Example  6.  For  r >  =  ak  with  0  <  a  <  1  we  have 

(n)1/pk  =  ak(l/p)k  ->  1 

for  any  p  >  1 .  Thus  the  average  order  is  unity. 

As  before,  the  most  important  case  is  that  of  unity  order,  and  in  this  case  we 
define  the  average  convergence  ratio  as  lim  \r^  -  r*|1/A  .  Thus  for  the  geometric 

>oo 

sequence  =  cak ,  0  <  a  <  1,  the  average  convergence  ratio  is  a.  Paralleling 
the  earlier  definitions,  the  reader  can  then  in  a  similar  manner  define  corresponding 
notions  of  average  linear  and  average  superlinear  convergence. 

Although  the  above  array  of  definitions  can  be  further  embellished  and  expanded, 
it  is  quite  adequate  for  our  purposes.  For  the  most  part  we  work  with  the  step-wise 
definitions,  since  in  analyzing  iterative  algorithms  it  is  natural  to  compare  one  step 
with  the  next.  In  most  situations,  moreover,  when  the  sequences  are  well  behaved 
and  the  limits  exist  in  the  definitions,  then  the  step-wise  and  average  concepts  of 
convergence  rates  coincide. 


*  Convergence  of  Vectors 

Suppose  {x^}^20  is  a  sequence  of  vectors  in  En  converging  to  a  vector  x*.  The  con¬ 
vergence  properties  of  such  a  sequence  are  defined  with  respect  to  some  particular 
function  that  converts  the  sequence  of  vectors  into  a  sequence  of  numbers.  Thus, 
if  /  is  a  given  continuous  function  on  En,  the  convergence  properties  of  {x^}  can 
be  defined  with  respect  to  /  by  analyzing  the  convergence  of  /(xjt)  to  /(x*).  The 
function  /  used  in  this  way  to  measure  convergence  is  called  the  error  function. 

In  optimization  theory  it  is  common  to  choose  the  error  function  by  which  to 
measure  convergence  as  the  same  function  that  defines  the  objective  function  of  the 
original  optimization  problem.  This  means  we  measure  convergence  by  how  fast  the 
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objective  converges  to  its  minimum,  alternatively,  we  sometimes  use  the  function 
|x  -  x*|2  and  thereby  measure  convergence  by  how  fast  the  (squared)  distance  from 
the  solution  point  decreases  to  zero. 

Generally,  the  order  of  convergence  of  a  sequence  is  insensitive  to  the  particular 
error  function  used;  but  for  step-wise  linear  convergence  the  associated  convergence 
ratio  is  not.  Nevertheless,  the  average  convergence  ratio  is  not  too  sensitive,  as  the 
following  proposition  demonstrates,  and  hence  the  particular  error  function  used  to 
measure  convergence  is  not  really  very  important. 

Proposition.  Let  f  and  g  be  two  error  functions  satisfying  fix*)  =  g(x*)  =  0  and,  for  all  x, 
a  relation  of  the  form 

0  <  flig(x)  <  fix)  <  a2g(x ) 

for  some  fixed  a\  >  0,  a2  >  0.  If  the  sequence  {x^}“0  converges  to  x*  linearly  with  average 
ratio  J3  with  respect  to  one  of  these  functions,  it  also  does  so  with  respect  to  the  other. 

Proof.  The  statement  is  easily  seen  to  be  symmetric  in  /  and  g.  Thus  we  assume 
{xk}  is  linearly  convergent  with  average  convergence  ratio  p  with  respect  to  /,  and 
will  prove  that  the  same  is  true  with  respect  to  g.  We  have 

P  =  lim  f(xk)l,k  <  lim  a2/kg(xk)'/k  =  lim  g(xk)l/k 

K— >co  K— >oo  A:— >oo 

and 

P  =  lim  f(xk)l/k  >  lim  a\/kg(xk)Uk  =  lim  g(xk)1/k. 

k^oo  k^oo  >00 


Thus 

P  =  lim  g(xk)ljk.  I 

»oo 

As  an  example  of  an  application  of  the  above  proposition,  consider  the  case 
where  g(x)  =  |x  -  x*|2  and  /(x)  =  (x  -  x*)rQ(x  -  x*),  where  Q  is  a  positive  defi¬ 
nite  symmetric  matrix.  Then  a\  and  a 2  correspond,  respectively,  to  the  smallest  and 
largest  eigenvalues  of  Q.  Thus  average  linear  convergence  is  identical  with  respect 
to  any  error  function  constructed  from  a  positive  definite  quadratic  form. 


Complexity 

Complexity  theory  as  outlined  in  Sect.  5.1  is  an  important  aspect  of  convergence 
theory.  This  theory  can  be  used  in  conjunction  with  the  theory  of  local  convergence. 
If  an  algorithm  converges  according  to  any  order  greater  than  zero,  then  for  a  fixed 
problem,  the  sequence  generated  by  the  algorithm  will  converge  in  a  time  that  is  a 
function  of  the  convergence  order  (and  rate,  if  convergence  is  linear).  For  example, 
if  the  order  is  one  with  rate  0  <  c  <  1  and  the  process  begins  with  an  error  of  R , 
a  final  error  of  r  can  be  achieved  by  a  number  of  steps  n  satisfying  cnR  <  r.  Thus 
it  requires  approximately  n  =  1  og(R/r)/  log(l/c)  steps.  In  this  form  the  number  of 
steps  is  not  affected  by  the  size  of  the  problem.  However,  problem  size  enters  in 
two  possible  ways.  First,  the  rate  c  may  depend  on  the  size-say  going  toward  1  as 
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the  size  increases  so  that  the  speed  is  slower  for  large  problems.  The  second  way 
that  size  may  enter,  and  this  is  the  more  important  way,  is  that  the  time  to  exe¬ 
cute  a  single  step  almost  always  increases  with  problem  size.  For  instance  if,  for  a 
problem  seeking  an  optimal  vector  of  dimension  m,  each  step  requires  a  Gaussian 
elimination  inversion  of  an  m  x  m  matrix,  the  solution  time  will  increase  by  a  factor 
proportional  to  m3.  Overall  the  algorithm  is  therefore  a  polynomial  time  algorithm. 
Essentially  all  algorithms  in  this  book  employ  steps,  such  as  matrix  multiplications 
or  inversion  or  other  algebraic  operations,  which  are  polynomial- time  in  character. 
Convergence  analysis,  therefore,  focuses  on  whether  an  algorithm  is  globally  con¬ 
vergent,  on  its  local  convergence  properties,  and  also  on  the  order  of  the  algebraic 
operations  required  to  execute  the  steps  required.  The  last  of  these  is  usually  easily 
deduced  by  listing  the  number  and  size  of  the  required  vector  and  matrix  operations. 


7.9  Summary 

There  are  two  different  but  complementary  ways  to  characterize  the  solution  to 
unconstrained  optimization  problems.  In  the  local  approach,  one  examines  the  re¬ 
lation  of  a  given  point  to  its  neighbors.  This  leads  to  the  conclusion  that,  at  an 
unconstrained  relative  minimum  point  of  a  smooth  function,  the  gradient  of  the 
function  vanishes  and  the  Hessian  is  positive  semidefinite;  and  conversely,  if  at  a 
point  the  gradient  vanishes  and  the  Hessian  is  positive  definite,  that  point  is  a  rel¬ 
ative  minimum  point.  This  characterization  has  a  natural  extension  to  the  global 
approach  where  convexity  ensures  that  if  the  gradient  vanishes  at  a  point,  that  point 
is  a  global  minimum  point. 

In  considering  iterative  algorithms  for  finding  either  local  or  global  minimum 
points,  there  are  two  distinct  issues:  global  convergence  properties  and  local  con¬ 
vergence  properties.  The  first  is  concerned  with  whether  starting  at  an  arbitrary 
point  the  sequence  generated  will  converge  to  a  solution.  This  is  ensured  if  the 
algorithm  is  closed,  has  a  descent  function,  and  generates  a  bounded  sequence.  It 
is  also  explained  that  global  convergence  is  guaranteed  simply  by  the  inclusion,  in 
a  complex  algorithm,  of  spacer  steps.  This  result  is  called  upon  frequently  in  what 
follows.  Local  convergence  properties  are  a  measure  of  the  ultimate  speed  of  con¬ 
vergence  and  generally  determine  the  relative  advantage  of  one  algorithm  to  another. 


7.10  Exercises 

1 .  To  approximate  a  function  g  over  the  interval  [0, 1]  by  a  polynomial  p  of  degree 
n  (or  less),  we  minimize  the  criterion 

/(a)  =  f  [g(x)  -  p(x)fdx, 

Jo 

where  p(x)  =  anxn  +  an-\xn~l  +  . . .  +  a^.  Find  the  equations  satisfied  by  the 
optimal  coefficients  a  =  (ao,  a\,  . . . ,  an). 
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2.  In  Example  4  of  Sect.  7.2  show  that  if  the  solution  has  x\  >  0,  x\  +  x2  —  1,  then 
it  is  necessary  that 


b\-b2  +  (c\  -  c2)h(x i)  =  0 
b2  +  (c2  -  c3)h(x i  +  x2)  <  0. 

Hint :  One  way  is  to  reformulate  the  problem  in  terms  of  the  variables  x\  and 

y  =  *i  +  *2- 

3.  (a)  Using  the  first-order  necessary  conditions,  find  a  minimum  point  of  the 
function 


f(x,  y,  z )  =  2x2  +  xy  +  y2  +  yz  +  Z2  -  6x  -  ly  -  8z  +  9. 

(b)  Verify  that  the  point  is  a  relative  minimum  point  by  verifying  that  the 
second-order  sufficiency  conditions  hold. 

(c)  Prove  that  the  point  is  a  global  minimum  point. 

4.  In  this  exercise  and  the  next  we  develop  a  method  for  determining  whether  a 
given  symmetric  matrix  is  positive  definite.  Given  an  n  x  n  matrix  A  let 
denote  the  principal  submatrix  made  up  of  the  first  k  rows  and  columns.  Show 
(by  induction)  that  if  the  first  n—  1  principal  submatrices  are  nonsingular,  then 
there  is  a  unique  lower  triangular  matrix  L  with  unit  diagonal  and  a  unique 
upper  triangular  matrix  U  such  that  A  =  LU.  (See  Appendix  C.) 

5.  A  symmetric  matrix  is  positive  definite  if  and  only  if  the  determinant  of  each 
of  its  principal  submatrices  is  positive.  Using  this  fact  and  the  considerations  of 
Exercise 4,  show  that  an  nxn  symmetric  matrix  A  is  positive  definite  if  and  only 
if  it  has  an  LU  decomposition  (without  interchange  of  rows)  and  the  diagonal 
elements  of  U  are  all  positive. 

6.  Using  Exercise  5  show  that  an  nxn  matrix  A  is  symmetric  and  positive  definite 

'T 

if  and  only  if  it  can  be  written  as  A  =  GG  where  G  is  a  lower  triangular  matrix 
with  positive  diagonal  elements.  This  representation  is  known  as  the  Cholesky 
factorization  of  A. 

7.  Let  fj ,  /  €  7  be  a  collection  of  convex  functions  defined  on  a  convex  set  Q. 
Show  that  the  function  /  defined  by  fix)  =  sup/j(x)  is  convex  on  the  region 

iel 

where  it  is  finite. 

8.  Let  y  be  a  monotone  nondecreasing  function  of  a  single  variable  (that  is,  y(r)  < 
y(r')  for  r'  >  r)  which  is  also  convex;  and  let  /  be  a  convex  function  defined 
on  a  convex  set  Q.  Show  that  the  function  y(f)  defined  by  y(/)(x)  =  y[/(x)]  is 
convex  on  Q. 

9.  Let  /  be  twice  continuously  differentiable  on  a  region  Q  c  En.  Show  that  a 
sufficient  condition  for  a  point  x*  in  the  interior  of  Q  to  be  a  relative  minimum 
point  of  /  is  that  V/(x*)  =  0  and  that  /  be  locally  convex  at  x*. 
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10.  Define  the  point-to-set  mapping  on  En  by 

A(x)  =  {y  :  yTx  «  b), 
where  b  is  a  fixed  constant.  Is  A  closed? 

1 1 .  Prove  the  two  corollaries  in  Sect.  7.6  on  the  closedness  of  composite  mappings. 

12.  Show  that  if  A  is  a  continuous  point-to-point  mapping,  the  Global  Conver¬ 
gence  Theorem  is  valid  even  without  assumption  (i).  Compare  with  Example  2, 
Sect.  7.7. 

13.  Let  {r^}“0  and  {ck}™=0  be  sequences  of  real  numbers.  Suppose  — >  0  average 
linearly  and  that  there  are  constants  c  >  0  and  C  such  that  c  <  Ck  <  C  for  all  k. 
Show  that  Ckrk  — >  0  average  linearly. 

14.  Prove  a  proposition,  similar  to  the  one  in  Sect.  7.8,  showing  that  the  order  of 
convergence  is  insensitive  to  the  error  function. 

15.  Show  that  if  — >  r*  (step-wise)  linearly  with  convergence  ratio  / 3 ,  then  — > 

r*  (average)  linearly  with  average  convergence  ratio  no  greater  than  J3. 
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7. 1-7.5  For  alternative  discussions  of  the  material  in  these  sections,  see  Hadley 
[H2],  Fiacco  and  McCormick  [F4],  Zangwill  [Z2]  and  Luenberger  [L8]. 

7.6  Although  the  general  concepts  of  this  section  are  well  known,  the  formula¬ 
tion  as  zero-order  conditions  appears  to  be  new. 

7.7  The  idea  of  using  a  descent  function  (usually  the  objective  itself)  in  order 
to  guarantee  convergence  of  minimization  algorithms  is  an  old  one  that 
runs  through  most  literature  on  optimization,  and  has  long  been  used  to 
establish  global  convergence.  Formulation  of  the  general  Global  Conver¬ 
gence  Theorem,  which  captures  the  essence  of  many  previously  diverse 
arguments,  and  the  idea  of  representing  an  algorithm  as  a  point-to-set  map¬ 
ping  are  both  due  to  Zangwill  [Z2].  A  version  of  the  Spacer  Step  Theorem 
can  be  found  in  Zangwill  [Z2]  as  well. 

7.8  Most  of  the  definitions  given  in  this  section  have  been  standard  for  quite 
some  time.  A  thorough  discussion  which  contributes  substantially  to  the 
unification  of  these  concepts  is  contained  in  Ortega  and  Rheinboldt  [07] . 


Chapter  8 

Basic  Descent  Methods 


We  turn  now  to  a  description  of  the  basic  techniques  used  for  iteratively  solving 
unconstrained  minimization  problems.  These  techniques  are,  of  course,  important 
for  practical  application  since  they  often  offer  the  simplest,  most  direct  alternatives 
for  obtaining  solutions;  but  perhaps  their  greatest  importance  is  that  they  establish 
certain  reference  plateaus  with  respect  to  difficulty  of  implementation  and  speed 
of  convergence.  Thus  in  later  chapters  as  more  efficient  techniques  and  techniques 
capable  of  handling  constraints  are  developed,  reference  is  continually  made  to  the 
basic  techniques  of  this  chapter  both  for  guidance  and  as  points  of  comparison. 

There  is  a  fundamental  underlying  structure  for  almost  all  the  descent  algorithms 
we  discuss.  One  starts  at  an  initial  point;  determines,  according  to  a  fixed  rule,  a 
direction  of  movement;  and  then  moves  in  that  direction  to  a  (relative)  minimum  of 
the  objective  function  on  that  line.  At  the  new  point  a  new  direction  is  determined 
and  the  process  is  repeated.  The  primary  differences  between  algorithms  (steepest 
descent,  Newton’s  method,  etc.)  rest  with  the  rule  by  which  successive  directions  of 
movement  are  selected.  Once  the  selection  is  made,  all  algorithms  call  for  movement 
to  the  minimum  point  on  the  corresponding  line. 

The  process  of  determining  the  minimum  point  on  a  given  line  (one  variable 
only)  is  called  line  search.  For  general  nonlinear  functions  that  cannot  be  minimized 
analytically,  this  process  actually  is  accomplished  by  searching,  in  an  intelligent 
manner,  along  the  line  for  the  minimum  point.  These  line  search  techniques,  which 
are  really  procedures  for  solving  one-dimensional  minimization  problems,  form  the 
backbone  of  nonlinear  programming  algorithms,  since  higher  dimensional  problems 
are  ultimately  solved  by  executing  a  sequence  of  successive  line  searches.  There  are 
a  number  of  different  approaches  to  this  important  phase  of  minimization  and  the 
first  half  of  this  chapter  is  devoted  to  their,  discussion. 

The  last  sections  of  the  chapter  are  devoted  to  a  description  and  analysis  of 
the  basic  descent  algorithms  for  unconstrained  problems;  steepest  descent,  coor¬ 
dinate  descent,  and  Newton’s  method.  These  algorithms  serve  as  primary  models 
for  the  development  and  analysis  of  all  others  discussed  in  the  book. 
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8  Basic  Descent  Methods 


8.1  Line  Search  Algorithms 

These  algorithms  are  classified  by  the  order  of  information  of  the  objective  functions 
f(x)  being  evaluated. 


0 th-Order  Method:  Golden  Section  Search  and  Curve  Fitting 

A  very  popular  method  for  resolving  the  line  search  problem  is  the  Fibonacci  search 
method  described  in  this  section.  The  method  has  a  certain  degree  of  theoretical 
elegance,  which  no  doubt  partially  accounts  for  its  popularity,  but  on  the  whole,  as 
we  shall  see,  there  are  other  procedures  which  in  most  circumstances  are  superior. 

The  method  determines  the  minimum  value  of  a  function  /  over  a  closed  interval 
[c i,  c{\.  In  applications,  /  may  in  fact  be  defined  over  a  broader  domain,  but  for 
this  method  a  fixed  interval  of  search  must  be  specified.  The  only  property  that  is 
assumed  of  /  is  that  it  is  unimodal ,  that  is,  it  has  a  single  relative  minimum  (see 
Fig.  8.1).  The  minimum  point  of  /  is  to  be  determined,  at  least  approximately,  by 
measuring  the  value  of  /  at  a  certain  number  of  points.  It  should  be  imagined,  as  is 
indeed  the  case  in  the  setting  of  nonlinear  programming,  that  each  measurement  of 
/  is  somewhat  costly — of  time  if  nothing  more. 

To  develop  an  appropriate  search  strategy,  that  is,  a  strategy  for  selecting  mea¬ 
surement  points  based  on  the  previously  obtained  values,  we  pose  the  following 
problem:  Find  how  to  successively  select  N  measurement  points  so  that,  without 
explicit  knowledge  of  /,  we  can  determine  the  smallest  possible  region  of  uncer¬ 
tainty  in  which  the  minimum  must  lie.  In  this  problem  the  region  of  uncertainty  is 
determined  in  any  particular  case  by  the  relative  values  of  the  measured  points  in 
conjunction  with  our  assumption  that  /  is  unimodal.  Thus,  after  values  are  known 
at  N  points  x\,  X2,  . . . ,  x n  with 


C i  <  X\  <  X2  ...  <  XN-\  <  XN  <  C2, 


the  region  of  uncertainty  is  the  interval  [xk~u  x^+i]  where  Xk  is  the  minimum  point 
among  the  A,  and  we  define  xq  =  c  i,  =  C2  for  consistency.  The  minimum  of  / 
must  lie  somewhere  in  this  interval. 

The  derivation  of  the  optimal  strategy  for  successively  selecting  measurement 
points  to  obtain  the  smallest  region  of  uncertainty  is  fairly  straight-forward  but 
somewhat  tedious.  We  simply  state  the  result  and  give  an  example. 

Let 


d\  =  C2  -  ci,  the  initial  width  of  uncertainty 
dk  =  width  of  uncertainty  after  k  measurements 


8.1  Line  Search  Algorithms 


215 
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Fig.  8.1  A  unimodal  function 


Then,  if  a  total  of  N  measurements  are  to  be  made,  we  have 

(B.l) 

Fn  ! 

where  the  integers  F \  are  members  of  the  Fibonacci  sequence  generated  by  the 
recurrence  relation 


Fn  -  Fn-i  +  Fn~ 2,  Fq  -  F\  -  1.  (8.2) 

The  resulting  sequence  is  1,  1,  2,  3,  5,  8,  13, ...  . 

The  procedure  for  reducing  the  width  of  uncertainty  to  dN  is  this:  The  first  two 
measurements  are  made  symmetrically  at  a  distance  of  (Fn-\I  F^d\  from  the  ends 
of  the  initial  intervals;  according  to  which  of  these  is  of  lesser  value,  an  uncertainty 
interval  of  width  J2  =  (FN-i/FN)d\  is  determined.  The  third  measurement  point  is 
placed  symmetrically  in  this  new  interval  of  uncertainty  with  respect  to  the  measure¬ 
ment  already  in  the  interval.  The  result  of  this  third  measurement  gives  an  interval 
of  uncertainty  J3  =  ( FN-2/FN)d\ .  In  general,  each  successive  measurement  point 
is  placed  in  the  current  interval  of  uncertainty  symmetrically  with  the  point  already 
existing  in  that  interval. 

Some  examples  are  shown  in  Fig.  8.2.  In  these  examples  the  sequence  of  mea¬ 
surement  points  is  determined  in  accordance  with  the  assumption  that  each  measure¬ 
ment  is  of  lower  value  than  its  predecessors.  Note  that  the  procedure  always  calls 
for  the  last  two  measurements  to  be  made  at  the  midpoint  of  the  semifinal  interval  of 
uncertainty.  We  are  to  imagine  that  these  two  points  are  actually  separated  a  small 
distance  so  that  a  comparison  of  their  respective  values  will  reduce  the  interval  to 
nearly  half.  This  terminal  anomaly  of  the  Fibonacci  search  process  is,  of  course,  of 
no  great  practical  consequence. 
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Search  by  Golden  Section 

If  the  number  N  of  allowed  measurement  points  in  a  Fibonacci  search  is  made  to 
approach  infinity,  we  obtain  the  golden  section  method.  It  can  be  argued,  based  on 
the  optimal  property  of  the  finite  Fibonacci  method,  that  the  corresponding  infinite 
version  yields  a  sequence  of  intervals  of  uncertainty  whose  widths  tend  to  zero  faster 
than  that  which  would  be  obtained  by  other  methods. 

©® 
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Fig.  8.2  Fibonacci  search 


The  solution  to  the  Fibonacci  difference  equation 

Fn  =  Fn-  i  +  Fn-2 


is  of  the  form 

Fn=Atnx+  Bt%, 

where  T\  and  T2  are  roots  of  the  characteristic  equation 

T2  =  T  +  1 . 


Explicitly, 


1  +  V5  1  -  V5 


(8.3) 

(8.4) 


T 1  = 


2 


,  t2  = 


2 
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(The  number  T\  -  1 .61 8  is  known  as  the  golden  section  ratio  and  was  considered  by 
early  Greeks  to  be  the  most  aesthetic  value  for  the  ratio  of  two  adjacent  sides  of  a 
rectangle.)  For  large  N  the  first  term  on  the  right  side  of  (8.4)  dominates  the  second, 
and  hence 

lim  Lzi  =  _L  ~  o.618. 

N—>oo  F  N  T 1 

It  follows  from  (8.1)  that  the  interval  of  uncertainty  at  any  point  in  the  process  has 
width 

dk  =  d\,  (8.5) 

and  from  this  it  follows  that 


=  —  =  0.618.  (8.6) 
dk  t  i 

Therefore,  we  conclude  that,  with  respect  to  the  width  of  the  uncertainty  interval,  the 
search  by  golden  section  converges  linearly  (see  Sect.  7.8)  to  the  overall  minimum 
of  the  function  /  with  convergence  ratio  1/ri  =0.618. 

The  Fibonacci  search  method  has  a  certain  amount  of  theoretical  appeal,  since  it 
assumes  only  that  the  function  being  searched  is  unimodal  and  with  respect  to  this 
broad  class  of  functions  the  method  is,  in  some  sense,  optimal.  In  most  problems, 
however,  it  can  be  safely  assumed  that  the  function  being  searched,  as  well  as  being 
unimodal,  possesses  a  certain  degree  of  smoothness,  and  one  might,  therefore,  ex¬ 
pect  that  more  efficient  search  techniques  exploiting  this  smoothness  can  be  devised; 
and  indeed  they  can.  Techniques  of  this  nature  are  usually  based  on  curve  fitting  pro¬ 
cedures  where  a  smooth  curve  is  passed  through  the  previously  measured  points  in 
order  to  determine  an  estimate  of  the  minimum  point.  A  variety  of  such  techniques 
can  be  devised  depending  on  whether  or  not  derivatives  of  the  function  as  well  as  the 
values  can  be  measured,  how  many  previous  points  are  used  to  determine  the  fit,  and 
the  criterion  used  to  determine  the  fit.  In  this  section  a  number  of  possibilities  are 
outlined  and  analyzed.  All  of  them  have  orders  of  convergence  greater  than  unity. 


Quadratic  Fit 


The  scheme  that  is  often  most  useful  in  line  searching  is  that  of  fitting  a  quadratic 
through  three  given  points.  This  has  the  advantage  of  not  requiring  any  deriva¬ 
tive  information.  Given  x\ ,  xz,  *3  and  corresponding  values  f(x i)  =  /i,  f(xz)  = 
fz,  f(xs)  =  we  construct  the  quadratic  passing  through  these  points 


q{x)  =  Yjf 


i=  1 


Ujtiix  -  Xj) 
Y\j*i(Xi  -  Xj) 


(B.7) 
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and  determine  a  new  point  X4  as  the  point  where  the  derivative  of  q  vanishes.  Thus 


1  ^23/1  +  £31/2  +  ^12/3 

2  023/1  +  031/2  +  012/3 


(8.8) 


where  aij  =  v;  -  vy,  b[j  -  xf  -  xj. 

Define  the  errors  st  =  x*  -  xu  i  =  1, 2, 3, 4.  The  expression  for  S4  must  be  a 
polynomial  in  £\ ,  £3.  It  must  be  second  order  (since  it  is  a  quadratic  fit).  It  must 

go  to  zero  if  any  two  of  the  errors  s\,  s 2 ,  £3  is  zero.  (The  reader  should  check  this.) 
Finally,  it  must  be  symmetric  (since  the  order  of  points  is  relevant).  It  follows  that 
near  a  minimum  point  v*  of  /,  the  errors  are  related  approximately  by 


£4  =  M(sis2  +  S2S3  +  S1S3),  (8.9) 

where  M  depends  on  the  values  of  the  second  and  third  derivatives  of  /  at  x*. 

If  we  assume  that  Gk  —>  0  with  an  order  greater  than  unity,  then  for  large  k  the 
error  is  governed  approximately  by 


Sk+ 2  M  £]{£]{—  1 . 


Letting  y =  log  Msk  this  becomes 

yk+2  =  yk  +  yk-i 

with  characteristic  equation 

A3  -  A  -1=0. 

The  largest  root  of  this  equation  is  A  ^  1.3  which  thus  determines  the  rate  of  growth 
of  yk  and  is  the  order  of  convergence  of  the  quadratic  fit  method. 


1st- Order  Method:  Curve  Fitting  and  Methods  of  False  Position 

In  this  section  a  number  fitting  methods  using  the  first  derivative  information  are 
described.  All  of  them  have  orders  of  convergence  greater  than  unity. 


Quadratic  Fit:  Method  of  False  Position 

Suppose  that  at  two  points  Xk  and  x^-i  where  measurements  f(xk ),  f\xk ),  f'(xk-i) 
are  available,  it  is  possible  to  fit  the  quadratic 


q(x)  =  f(xk)  +  f\xk)(x  -  xk)  + 


/'(**-!)-/'(**) 


(x  -  xk )2 


%k— 1  %k 


2 
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which  has  the  same  corresponding  values.  An  estimate  Xk+i  can  then  be  determined 
by  finding  the  point  where  the  derivative  of  q  vanishes;  thus 


Xk+ 1  =Xk-  f(xk) 


Xk- 1  Xk 
f'(xk-i)  -  f'(xk) 


(8.10) 


(See  Fig.  8.3.)  Comparing  this  formula  with  Newton’s  method,  we  see  again  that 
the  value  f(xf)  does  not  enter;  hence,  our  fit  could  have  been  passed  through  either 
f(xk)  or  f(xk-i).  Also  the  formula  can  be  regarded  as  an  approximation  to  New¬ 
ton’s  method  where  the  second  derivative  is  replaced  by  the  difference  of  two  first 
derivatives. 


Fig.  8.3  False  position  for  minimization 


Again,  since  this  method  does  not  depend  on  values  of  /  directly,  it  can  be 
regarded  as  a  method  for  solving  fix)  =  g(x)  =  0.  Viewed  in  this  way  the  method, 
which  is  illustrated  in  Fig.  8.4,  takes  the  form 


Xk+l  =Xk-  g(xk) 


Xk  Xfc- 1 

g(Xk)  -g(xk-i) 


(8.11) 


We  next  investigate  the  order  of  convergence  of  the  method  of  false  position  and 
discover  that  it  is  order  t\  —  1.618,  the  golden  mean. 

Proposition.  Let  g  have  a  continuous  second  derivative  and  suppose  x*  is  such  that  g(x*)  = 

0,  g'(x*)  ±  0.  Then  for  xq  sufficiently  close  to  x*,  the  sequence  {xjdJL()  generated  by  the 
method  of  false  position  (8.11)  converges  to  x*  with  order  T\  -  1.618. 


Proof.  Introducing  the  notation 


g[a,  b]  = 


gib)  -  g(a) 


b  -  a 


9 


(8.12) 
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Fig.  8.4  False  position  for  solving  equations 


we  have 


Xk- 1  -  X*  =  xk-x*  -  g(xk) 


Xk  %k— 1 


=  (xk  -  X*) 


g(xk)-g(xk- 1) 
g[xk-\,xk]  -g[xk,x*] 
g[xk-Uxk] 


Further,  upon  the  introduction  of  the  notation 

g[a,b]  -g[b,c] 


g[a,  b,  c ]  = 


a  -  c 


we  may  write  (8.13)  as 


Xk+ 1  -X*  =  (xk  -  x*)(xk-i  -  X *) 


g[xk-  uXk,X*] 
g[xk-l,Xk] 


(8.13) 


Now,  by  the  mean  value  theorem  with  remainder,  we  have  (see  Exercise  2) 


g[xk-u  Xk\  =  g'(gk) 


(8.14) 


and 


g[xk- 1,  xk,  v*]  =  - g"(rjk ), 


(8.15) 


where  gk  and  rjk  are  convex  combinations  of  xk,  xk~i  and  xk,  xk-i,  x*,  respectively. 
Thus 


xk+ 1  x 


g"(m) 

2  g'itk) 


(xk  -  x*)(xk-i  -  X*). 


(8.16) 


It  follows  immediately  that  the  process  converges  if  it  is  started  sufficiently  close 
to  V*. 

To  determine  the  order  of  convergence,  we  note  that  for  large  k  Eq.  (8.16)  be¬ 
comes  approximately 
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Xk+i  -X*  =  M(xk  -  x*)(xk-i  -  x*). 


where 


g"(x * 

2  g'(x*) 


Thus  defining  sk  =  (xk  -  T)  we  have,  in  the  limit, 


£'k+ 1  M  eksk- 1 . 


(8.17) 


Taking  the  logarithm  of  this  equation  we  have,  with  yk  =  log  Msk, 

yk+i  =yk  +  yk-u  (8.18) 


which  is  the  Fibonacci  difference  equation  discussed  in  Sect.  7.1.  A  solution  to  this 
equation  will  satisfy 


Thus 


yk+i  -  Tiyk  ->  0. 


log  Mek+ 1  -  r i  log  Msk 


0  or  log 


Msk+ 1 
(Msk)Tl 


and  hence 

^21  M(ri-1). I 


Having  derived  the  error  formula  (8.17)  by  direct  analysis,  it  is  now  appropriate 
to  point  out  a  short-cut  technique,  based  on  symmetry  and  other  considerations, 
that  can  sometimes  be  used  in  even  more  complicated  situations.  The  right  side  of 
error  formula  (8.17)  must  be  a  polynomial  in  sk  and  sk- 1,  since  it  is  derived  from 
approximations  based  on  Taylor’s  theorem.  Furthermore,  it  must  be  second  order, 
since  the  method  reduces  to  Newton’s  method  when  xk  =  xk-\ .  Also,  it  must  go 
to  zero  if  either  sk  or  sk_\  go  to  zero,  since  the  method  clearly  yields  sk+\  -  0  in 
that  case.  Finally,  it  must  be  symmetric  in  sk  and  sk-\,  since  the  order  of  points  is 
irrelevant.  The  only  formula  satisfying  these  requirements  is  sk+\  =  Msksk- 


Cubic  Fit 


Given  the  points  xk-\  and  xk  together  with  the  values  f(xk- 1),  f\xk- 1),  f(xk ),  f'(xk), 
it  is  also  possible  to  fit  a  cubic  equation  to  the  points  having  corresponding  values. 
The  next  point  xk+\  can  then  be  determined  as  the  relative  minimum  point  of  this 
cubic.  This  leads  to 


%k+l  —  (Xk  %k—  l) 


f'(xk )  +  U2-Uj 

f'(*k)  -  f'(xk- 1)  +  2u2 


(8.19) 


where 


U\  =  f\xk-x)  +  f(xk)  -  3 


f(xk- 1)  -  f(xk) 


U2  =  [u\  ~  f(xk-i)f(xk)]l/2, 


Xk- 1  %k 
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which  is  easily  implementable  for  computations. 

It  can  be  shown  (see  Exercise  3)  that  the  order  of  convergence  of  the  cubic  fit 
method  is  2.0.  Thus,  although  the  method  is  exact  for  cubic  functions  indicating 
that  its  order  might  be  three,  its  order  is  actually  only  two. 


2nd- Order  Method:  Newton’s  Method 

Suppose  that  the  function  /  of  a  single  variable  v  is  to  be  minimized,  and  suppose 
that  at  a  point  Xk  where  a  measurement  is  made  it  is  possible  to  evaluate  the  three 
numbers  f(xk),  f'(xk),  f"(xk).  It  is  then  possible  to  construct  a  quadratic  function 
q  which  at  Xk  agrees  with  /  up  to  second  derivatives,  that  is 

q(x)  =  f(xk)  +  f'(xk)(x  -  xk)  +  !/"(**)(*  -  xk)2.  (8.20) 

We  may  then  calculate  an  estimate  xk+\  of  the  minimum  point  of  /  by  finding  the 
point  where  the  derivative  of  q  vanishes.  Thus  setting 

0  =  q'(xk+ 1)  =  f(xk)  +  f"(xk)(xk+ 1  -  xk). 


Fig.  8.5  Newton’s  method  for  minimization 


we  find 

f(xk)  /001, 

Xk+i  —  Xk  •  (8.21) 

/  Ufc) 

This  process,  which  is  illustrated  in  Fig.  8.5,  can  then  be  repeated  at  Xk+\. 

We  note  immediately  that  the  new  point  Xk+\  resulting  from  Newton’s  method 
does  not  depend  on  the  value  f(xk).  The  method  can  more  simply  be  viewed  as  a 
technique  for  iteratively  solving  equations  of  the  form 


g(x)  =  0, 
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where,  when  applied  to  minimization,  we  put  g(x)  =  /'(x).  In  this  notation  Newton’s 
method  takes  the  form 


%k+ 1  —  %k 


g{Xk) 
g'(Xk )  ' 


(8.22) 


This  form  is  illustrated  in  Fig.  8.6. 

We  now  show  that  Newton’s  method  has  order  two  convergence: 


Proposition.  Let  the  function  g  have  a  continuous  second  derivative,  and  let  x*  satisfy 
g(x*)  =  0,  g'(x*)  i1  0.  Then,  provided  xo  is  sufficiently  close  to  x*,  the  sequence  {xk}f=0 
generated  hy  Newton’s  method  (8.22)  converges  to  x*  with  an  order  of  convergence  at  least 
two. 


Proof.  For  points  f  in  a  region  near  x*  there  is  a  k\  such  that  \g"(f)\  <  k\  and  a  k2 
such  that  |g'(f)|  >  k2.  Then  since  g(x*)  =  0  we  can  write 


*  *  g(Xk)  -  g(x  ) 

Xk+1  -X  =  xk-x - — - - 

g(xk) 

=  ~lg(Xk)  -  g(x*)  +  g(xk)(x*  -  Xk)]lg{Xk). 


Fig.  8.6  Newton’s  method  for  solving  equations 


The  term  in  brackets  is,  by  Taylor’s  theorem,  zero  to  first-order.  In  fact,  using  the 
remainder  term  in  a  Taylor  series  expansion  about  Xk,  we  obtain 


Xk+ 1  -  x* 


1  g"(Q 

2  g'(Xk) 


(Xk  ~  x*)2 


for  some  ^  between  x*  and  x^.  Thus  in  the  region  near  x*, 


ki 

2k2 


\xk  -  x*|2. 


We  see  that  if  |x^  -  x*\ki/2k2  <  1,  then  |x^+i  -  x*|  <  |x^  -  x*|  and  thus  we  conclude 
that  if  started  close  enough  to  the  solution,  the  method  will  converge  to  x*  with  an 
order  of  convergence  at  least  two.  I 
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Global  Convergence  of  Curve  Fitting 

Above,  we  analyzed  the  convergence  of  various  curve  fitting  procedures  in  the 
neighborhood  of  the  solution  point.  If,  however,  any  of  these  procedures  were 
applied  in  pure  form  to  search  a  line  for  a  minimum,  there  is  the  danger — alas, 
the  most  likely  possibility — that  the  process  would  diverge  or  wander  about  mean- 
inglessly.  In  other  words,  the  process  may  never  get  close  enough  to  the  solution  for 
our  detailed  local  convergence  analysis  to  be  applicable.  It  is  therefore  important  to 
artfully  combine  our  knowledge  of  the  local  behavior  with  conditions  guaranteeing 
global  convergence  to  yield  a  workable  and  effective  procedure. 

The  key  to  guaranteeing  global  convergence  is  the  Global  Convergence  Theorem 
of  Chap.  7.  Application  of  this  theorem  in  turn  hinges  on  the  construction  of  a  suit¬ 
able  descent  function  and  minor  modifications  of  a  pure  curve  fitting  algorithm.  We 
offer  below  a  particular  blend  of  this  kind  of  construction  and  analysis,  taking  as 
departure  point  the  quadratic  fit  procedure  discussed  in  Sect.  8.1  above. 

Let  us  assume  that  the  function  /  that  we  wish  to  minimize  is  strictly  unimodal 
and  has  continuous  second  partial  derivatives.  We  initiate  our  search  procedure  by 
searching  along  the  line  until  we  find  three  points  x\,  X2,  x3  with  x\  <  X2  <  x3  such 
that  f{x  1)  >  f ixf)  <  fix 3).  In  other  words,  the  value  at  the  middle  of  these  three 
points  is  less  than  that  at  either  end.  Such  a  sequence  of  points  can  be  determined  in 
a  number  of  ways — see  Exercise  7. 

The  main  reason  for  using  points  having  this  pattern  is  that  a  quadratic  fit  to  these 
points  will  have  a  minimum  (rather  than  a  maximum)  and  the  minimum  point  will 
lie  in  the  interval  [x\9  x3\.  See  Fig.  8.7.  We  modify  the  pure  quadratic  fit  algorithm 
so  that  it  always  works  with  points  in  this  basic  three-point  pattern. 

The  point  X4  is  calculated  from  the  quadratic  fit  in  the  standard  way  and  fix 4) 
is  measured.  Assuming  (as  in  the  figure)  that  X2  <  X4  <  X3,  and  accounting  for  the 
unimodal  nature  of  /,  there  are  but  two  possibilities: 

1.  f(x4)  <  f(x2) 

2.  f(x2)  <  f(x4)  <  f(x3). 

In  either  case  a  new  three-point  pattern,  xi,x2,x3,  involving  X4  and  two  of  the  old 
points,  can  be  determined:  In  case  (8.1)  it  is 


iXuX2,X3)  =  ix2,  X4,  X3), 


while  in  case  (8.2)  it  is 

ix i,x2,x3)  =  (xu  x2,  x4). 

We  then  use  this  three-point  pattern  to  fit  another  quadratic  and  continue.  The  pure 
quadratic  fit  procedure  determines  the  next  point  from  the  current  point  and  the 
previous  two  points.  In  the  modification  above,  the  next  point  is  determined  from 
the  current  point  and  the  two  out  of  three  last  points  that  form  a  three-point  pattern 
with  it.  This  simple  modification  leads  to  global  convergence. 
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To  prove  convergence,  we  note  that  each  three-point  pattern  can  be  thought  of  as 
defining  a  vector  x  in  E3 .  Corresponding  to  an  x  =  (x\,  X2,  X3)  such  that  (x\,  X2,  X3) 
form  a  three-point  pattern  with  respect  to  /,  we  define  A(x)  =  (x\,X2,  X3)  as  dis¬ 
cussed  above.  For  completeness  we  must  consider  the  case  where  two  or  more 
of  the  Xi,  i  -  1,2,  3  are  equal,  since  this  may  occur.  The  appropriate  defini¬ 
tions  are  simply  limiting  cases  of  the  earlier  ones.  For  example,  if  x\  =  X2,  then 
(x\,  X2,  X3)  form  a  three-point  pattern  if  /fe)  <  /fe)  and  /'fe)  <  0  (which 
is  the  limiting  case  of  /fe)  <  f(x  1)).  A  quadratic  is  fit  in  this  case  by  using  the 
values  at  the  two  distinct  points  and  the  derivative  at  the  duplicated  point.  In  case 
x\  =  X2  =  X3,  (x\,  X2,  V3)forms  a  three-point  pattern  if  f'(x 2)  =  0  and  f"(x2)  >  0. 


\f(x  1) 

\ 

\ 

\ 

\ 


\ 


/M> 

/ 

/ 

/ 

/ 

/ 


*3 


Fig.  8.7  Three-point  pattern 


With  these  definitions,  the  map  A  is  well  defined.  It  is  also  continuous,  since  curve 
fitting  depends  continuously  on  the  data. 

We  next  define  the  solution  set  T  c  E3  as  the  points  x*  =  (x*,  x*,  x*)  where 
f'(x*)  =  0. 

Finally,  we  let  Z(x)  =  f(x  1)  +  f(x 2)  +  f(x 3).  It  is  easy  to  see  that  Z  is  a  descent 
function  for  A.  After  application  of  A  one  of  the  values  f(x  1),  f(x 2),  /(V3)  will  be 
replaced  by  f(x 4),  and  by  construction,  and  the  assumption  that  /  is  unimodal,  it  will 
replace  a  strictly  larger  value.  Of  course,  at  x*  =  (x*,  x*,  x*)  we  have  A(x*)  =  x* 
and  hence  Z(A(x*))  =  Z(x*). 

Since  all  points  are  contained  in  the  initial  interval,  we  have  all  the  requirements 
for  the  Global  Convergence  Theorem.  Thus  the  process  converges  to  the  solution. 
The  order  of  convergence  may  not  be  destroyed  by  this  modification,  if  near  the 
solution  the  three-point  pattern  is  always  formed  from  the  previous  three  points.  In 
this  case  we  would  still  have  convergence  of  order  1.3.  This  cannot  be  guaranteed, 
however. 

It  has  often  been  implicitly  suggested,  and  accepted,  that  when  using  the  quadratic 
fit  technique  one  should  require 


f(Xk+ 1)  <  f(Xk) 
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so  as  to  guarantee  convergence.  If  the  inequality  is  not  satisfied  at  some  cycle,  then  a 
special  local  search  is  used  to  find  a  better  Xk+i  that  does  satisfy  it.  This  philosophy 
amounts  to  taking  Z(x)  =  f(x 3)  in  our  general  framework  and,  unfortunately,  this 
is  not  a  descent  function  even  for  unimodal  functions,  and  hence  the  special  local 
search  is  likely  to  be  necessary  several  times.  It  is  true,  of  course,  that  a  similar 
special  local  search  may,  occasionally,  be  required  for  the  technique  we  suggest  in 
regions  of  multiple  minima,  but  it  is  never  required  in  a  unimodal  region. 

The  above  construction,  based  on  the  pure  quadratic  fit  technique,  can  be  emu¬ 
lated  to  produce  effective  procedures  based  on  other  curve  fitting  techniques.  For 
application  to  smooth  functions  these  techniques  seem  to  be  the  best  available  in 
terms  of  flexibility  to  accommodate  as  much  derivative  information  as  is  available, 
fast  convergence,  and  a  guarantee  of  global  convergence. 


*Closedness  of  Line  Search  Algorithms 

Since  searching  along  a  line  for  a  minimum  point  is  a  component  part  of  most  non¬ 
linear  programming  algorithms,  it  is  desirable  to  establish  at  once  that  this  pro¬ 
cedure  is  closed;  that  is,  that  the  end  product  of  the  iterative  procedures  outlined 
above,  when  viewed  as  a  single  algorithmic  step  finding  a  minimum  along  a  line, 
define  closed  algorithms.  That  is  the  objective  of  this  section. 

To  initiate  a  line  search  with  respect  to  a  function  /,  two  vectors  must  be  spec¬ 
ified:  the  initial  point  x  and  the  direction  d  in  which  the  search  is  to  be  made.  The 
result  of  the  search  is  a  new  point.  Thus  we  define  the  search  algorithm  S  as  a 
mapping  from  E2n  to  En. 

We  assume  that  the  search  is  to  be  made  over  the  semi-infinite  line  emanating 
from  x  in  the  direction  d.  We  also  assume,  for  simplicity,  that  the  search  is  not  made 
in  vain;  that  is,  we  assume  that  there  is  a  minimum  point  along  the  line.  This  will 
be  the  case,  for  instance,  if  /  is  continuous  and  increases  without  bound  as  x  tends 
toward  infinity. 

Definition.  The  mapping  S  :  E2n  — >  En  is  defined  by 

S(x,  d)  =  {y  :  y  =  x  +  ad  for  some  a  >  0,  /( y)  =  min  /(x  +  ad)}.  (8.23) 

0<a<oo 

In  some  cases  there  may  be  many  vectors  y  yielding  the  minimum,  so  S  is  a  set¬ 
valued  mapping.  We  must  verify  that  S  is  closed. 

Theorem.  Eet  f  be  continuous  on  En.  Then  the  mapping  defined  by  (8.23)  is  closed  at  (x,  d ) 
if  d  ^  0. 

Proof.  Suppose  {x^}  and  {d^}  are  sequences  with  x^  — >  x,  dk  — >  d  ^  0.  Suppose 
also  that  y^  e  S(x^,  dk)  and  that  y^  — >  y.  We  must  show  that  y  e  S(x,  d). 

For  each  k  we  have  y^  =  x^  +  a^k  for  some  From  this  we  may  write 
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Taking  the  limit  of  the  right-hand  side  of  the  above,  we  see  that 

-  ly  -  x| 

at  ->  a  = 


|d| 

It  then  follows  that  y  =  x  +  ad.  It  still  remains  to  be  shown  that  y  e  S(x,  d). 
For  each  k  and  each  a,  0  <  a  <  oo, 


/(yfc)  <  /(x*  +  adk). 


Letting  k  — >  oo  we  obtain 


/( y)  <  /(x  +  ad). 


Thus 


and  hence  y  e  S(x,  d). 


/( y)  <  min  /(x  +  ad), 

0<(2<oo 


The  requirement  that  d  A  0  is  natural  both  theoretically  and  practically.  From 
a  practical  point  of  view  this  condition  implies  that,  when  constructing  algorithms, 
the  choice  d  =  0  had  better  occur  only  in  the  solution  set;  but  it  is  clear  that  if  d  =  0, 
no  search  will  be  made.  Theoretically,  the  map  S  can  fail  to  be  closed  at  d  =  0,  as 
illustrated  below. 


Example.  On  E[  define  f(x )  =  (x  -  l)2.  Then  S  (x,  d)  is  not  closed  at  v  =  0,  d  =  0. 
To  see  this  we  note  that  for  any  d  >  0 


min  f(ad)  =  /( 1), 

0^Q'<oo 


and  hence 

but 


S(0,  d)  =  1; 


min  f(a  •  0) 

0<(2<oo 


so  that 


S(0,  0)  =  0. 

Thus  as  d  — >  0,  S  (0,  d)  S  (0,  0). 


Inaccurate  Line  Search 

In  practice,  of  course,  it  is  impossible  to  obtain  the  exact  minimum  point  called 
for  by  the  ideal  line  search  algorithm  S  described  above.  As  a  matter  of  fact,  it  is 
often  desirable  to  sacrifice  accuracy  in  the  line  search  routine  in  order  to  conserve 
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overall  computation  time.  Because  of  these  factors  we  must,  to  be  realistic,  be  cer¬ 
tain,  at  every  stage  of  development,  that  our  theory  does  not  crumble  if  inaccurate 
line  searches  are  introduced. 

Inaccuracy  generally  is  introduced  in  a  line  search  algorithm  by  simply  terminat¬ 
ing  the  search  procedure  before  it  has  converged.  The  exact  nature  of  the  inaccu¬ 
racy  introduced  may  therefore  depend  on  the  particular  search  technique  employed 
and  the  criterion  used  for  terminating  the  search.  We  cannot  develop  a  theory  that 
simultaneously  covers  every  important  version  of  inaccuracy  without  seriously  de¬ 
tracting  from  the  underlying  simplicity  of  the  algorithms  discussed  later.  For  this 
reason  our  general  approach,  which  is  admittedly  more  free-wheeling  in  spirit  than 
necessary  but  thereby  more  transparent  and  less  encumbered  than  a  detailed  account 
of  inaccuracy,  will  be  to  analyze  algorithms  as  if  an  accurate  line  search  were 
made  at  every  step,  and  then  point  out  in  side  remarks  and  exercises  the  effect  of 
inaccuracy. 


Armijo’s  Rule 

A  practical  and  popular  criterion  for  terminating  a  line  search  is  Armijo’s  rule.  The 
essential  idea  is  that  the  rule  should  first  guarantee  that  the  selected  a  is  not  too 
large,  and  next  it  should  not  be  too  small.  Let  us  define  the  function 

cp(a)  =  f(xk  +  a&k). 

Armijo’s  rule  is  implemented  by  consideration  of  the  function  0(0)  +  scp'(0)a  for 
fixed  s,  0  <  s  <  1.  This  function  is  shown  in  Fig.  8.8a  as  the  dashed  line.  A  value 
of  a  is  considered  to  be  not  too  large  if  the  corresponding  function  value  lies  below 
the  dashed  line;  that  is,  if 

0(cr)  <  0(0)  +  £cp'(0)a.  (8.24) 

To  insure  that  a  is  not  too  small,  a  value  rj  >  1  is  selected,  and  a  is  then  considered 
to  be  not  too  small  if 

(p(r/a)  >  0(0)  +  £(p'(0)rja. 

This  means  that  if  a  is  increased  by  the  factor  rj ,  it  will  fail  to  meet  the  test  (8.24). 
The  acceptable  region  defined  by  the  Armijo  rule  is  shown  in  Fig.  8.8a  when  rj  =  2 
(there  are  also  other  rules  can  be  adapted). 

Sometimes  in  practice,  the  Armijo  test  is  used  to  define  a  simplified  line  search 
technique  that  does  not  employ  curve  fitting  methods.  One  begins  with  an  arbitrary  a. 
If  it  satisfies  (8.24),  it  is  repeatedly  increased  by  77(77  =  2  or  77  =  10  and  £  -  .2  are 
often  used)  until  (8.24)  is  not  satisfied,  and  then  the  penultimate  a  is  selected.  If,  on 
the  other  hand,  the  original  a  does  not  satisfy  (8.24),  it  is  repeatedly  divided  by  rj 
until  the  resulting  a  does  satisfy  (8.24). 
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8.2  The  Method  of  Steepest  Descent 

One  of  the  oldest  and  most  widely  known  methods  for  minimizing  a  function 
of  several  variables  is  the  method  of  steepest  descent  (often  referred  to  as  the 
gradient  method).  The  method  is  extremely  important  from  a  theoretical  view¬ 
point,  since  it  is  one  of  the  simplest  for  which  a  satisfactory  analysis  exists.  More 
advanced  algorithms  are  often  motivated  by  an  attempt  to  modify  the  basic  steep¬ 
est  descent  technique  in  such  a  way  that  the  new  algorithm  will  have  superior 
convergence  properties.  The  method  of  steepest  descent  remains,  therefore,  not  only 
the  technique  most  often  first  tried  on  a  new  problem  but  also  the  standard  of  ref¬ 
erence  against  which  other  techniques  are  measured.  The  principles  used  for  its 
analysis  will  be  used  throughout  this  book. 


The  Method 

Let  /  have  continuous  first  partial  derivatives  on  En .  We  will  frequently  have  need 
for  the  gradient  vector  of  /  and  therefore  we  introduce  some  simplifying  notation. 
The  gradient  V/(x)  is,  according  to  our  conventions,  defined  as  a  ^-dimensional  row 
vector.  For  convenience  we  define  the  ^-dimensional  column  vector  g(x)  =  V/(x)r. 
When  there  is  no  chance  for  ambiguity,  we  sometimes  suppress  the  argument  x  and, 
for  example,  write  gk  for  g(x*)  =  Vf(xk)T. 

The  method  of  steepest  descent  is  defined  by  the  iterative  algorithm 

^/c+ 1  —  X k  ~ 

where  stepsize  ak  is  a  nonnegative  scalar  possibly  minimizing  f(xk-agk).  In  words, 
from  the  point  xk  we  search  along  the  direction  of  the  negative  gradient  -g^  to  a 
minimum  point  on  this  line;  this  minimum  point  is  taken  to  be  x^+i . 

In  formal  terms,  the  overall  algorithm  A  :  En  — >  En  which  gives  x^+i  e  A(x^) 
can  be  decomposed  in  the  form  A  =  SG.  Here  G  :  En  — »  E2n  is  defined  by  G(x)  = 
(x,  -g(x)),  giving  the  initial  point  and  direction  of  a  line  search.  This  is  followed  by 
the  line  search  S  :  E2n  — >  En  defined  in  Sect.  8.1. 


Global  Convergence  and  Convergence  Speed 

It  was  shown  in  Sect.  8.1  that  S  is  closed  if  V/(x)  =£  0,  and  it  is  clear  that  G  is 
continuous.  Therefore,  by  Corollary  2  in  Sect.  7.7  A  is  closed. 

We  define  the  solution  set  to  be  the  points  x  where  V/(x)  =  0.  Then  Z(x)  =  /(x) 
is  a  descent  function  for  A,  since  for  V/(x)  =£  0 

lim  f(x  —  Qfg(x))  <  f(x). 

0  <ff<oo 
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0 


0 


0 


Fig.  8.8  Stopping  rules,  (a)  Armijo  rule,  (b)  Golden  test,  (c)  Wolfe  test 


Thus  by  the  Global  Convergence  Theorem,  if  the  sequence  {x^}  is  bounded,  it  will 
have  limit  points  and  each  of  these  is  a  solution.  What  about  the  convergence  speed? 
Assume  that  /(x)  is  convex  and  differentiable  everywhere,  admits  a  minimizer  x*, 
and  satisfies  the  (first-order) /3-Lipschitz  condition,  that  is,  for  any  two  points  x  and  y 

|V/(x)  -  V/(y)|  <  f3\x  -  y| 

for  a  positive  real  number  J3.  Starting  from  any  point  x0,  we  consider  the  method  of 
steepest  descent  with  a  fixed  step  size  |  for  all  k : 
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1  1  T 
xk+i  =xk-  -gk  =  xk-  —Wf(xk)  . 


(8.25) 


We  first  prove  a  lemma. 

Lemma  1.  Let  fix)  be  differentiable  everywhere  and  satisfy  the  (first-order)  fi-Lipschitz 
condition.  Then,  for  any  two  points  x  and  y 

f(x )  -  f(y)  -  V/Cv)(jr  -  y)  <  ||x  -  y\2. 


Then  we  prove 


Theorem  1  (Steepest  Descent — Lipschitz  Convex  Case).  Let  f(x)  be  convex  and  differen¬ 
tiable  everywhere,  satisfy  the  ( first-order )  /3-Lipschitz  condition,  and  admit  a  minimizer  x*. 
Then,  the  method  of  steepest  descent  (8.25)  generates  a  sequence  of  solutions  x^  such  that 


|V/(x*)|  < 


fi2 

s/k(k  +  1 ) 


\xo  -  x * 


and 

/(**)-/(**)<  -L—\x0-x*\2. 

2 (k  +  1) 


Proof.  Consider  the  function  gx(y)  =  /( y)  -  V/(x)y  for  any  given  x.  Note  that  gx  is 
also  convex  and  satisfies  the  /2-Lipschitz  condition.  Moreover,  x  is  the  minimizer  of 
gx( y)  and  Vgx(y)  =  V/(y)  -  V/(x). 

Applying  Lemma  1  to  gx  and  noting  the  relations  of  gx  and  /(x),  we  have 


/(x)  -  /( y)  -  v/(x)(x  -  y)  =  gx(x)  -  gx(y) 

^  gx(y  -  J)Sgx(y))  -  gx(y) 

<  Vgx(y)(-^gx(y)T)  +  §  fWgxiyf  (8.26) 

=  -^IVfo(y)l2 
=  -^|V/(x)-V/(y)|2. 


Similarly,  we  have 

/( y)  -  /(x)  -  V/(y)(y  -  x)  <  -L|V/(x)  -  V/(y)|2. 
Adding  the  above  two  derived  inequalities,  we  have  for  any  x  and  y: 


(V/(x)  -  V/(y))(x  -  y)  >  U V/(x)  -  V/(y)|2. 


(8.27) 


For  simplification,  in  what  follows  let  =  Xk  -  x*  and  6k  =  [/(x^)  -  /(x*)]  >  0. 
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Now  let  x  =  x^+i  and  y  =  x^  in  (8.27).  Then 


-\(&k)T(£k+i  -  8 k)  =  (x*+ 1  -  *k)T(gk+i 


~  g k)  ^  olgt+l  -  g k 


2 


which  leads  to 


lg*+il2  ^  (g*+i)rgi-  ^  Igfe+illgfeU  *at  is  |g*+1|  <  Igjtl-  (8.28) 


Inequality  (8.28)  implies  that  \gk\  =  |V/(x^)|  is  monotonically  decreasing. 

Applying  inequality  (8.26)  for  x  =  xk  and  y  =  x*  and  noting  g*  =  0  we  have 

Sk  <  ( gk)Tdk  -  ±\gk\2 

=  ~/3(xk+l  -  xk)dk  -  f  1x^+1  -  xk\2 

=  — f  (|x*+i  -  xk\2  +  2{xk+\  -  xk)Tdk )  (8.29) 

=  -f  (|d*+1  -  dk\2  +  2(d*+1  -  dkfdk ) 

=  f(|d,|2-|d,+1|2). 

Summing  up  (8.29)  from  0  to  k ,  we  have 

k 

YjS,<  fddol2  -  |d,+1|2)  <  f|dol2-  (8.30) 

1=0 

Using  (8.26)  again  for  x  =  x^+i  and  y  =  x^  and  noting  (8.25)  we  have 

Sk+i  -Sk  =  /(x*+i)-/(x*) 

^g*+i(-|g*)-  ^l&+i  -g^l2  (8.31) 

=  -^(lgt+il2  +  lgtl2)- 

Noting  (8.31)  holts  for  all  k ,  we  have 


Z?=o  -  Zf=o  +  1  -  0 

=  Z/=o  +  1)  “  zf=0 

=  2i+i‘  -  Zil  5// 


—  6k{k  +  1)  +  Z/=i(<J/-i  - 
>  5kik  +  1)  +  Y!i=\  jfi(\gi\2  +  lg/-i 
>5k(k+  l)+M|D|gj2 


2 


) 


where  the  last  inequality  comes  |gj  =  |  V/(x^)|  is  monotonically  decreasing. 
Using  (8.30)  we  finally  have 


(k  +  1  )Sk  + 


*(*+1),  ,2 

2/3  l8fcl 


< 


(8.32) 


Inequality  (8.32),  from  8k  =  /(x^)  -  /(x*)  >  0  and  do  =  xo  -  x*,  proves  the  desired 
bounds.  I 

Theorem  1  implies  that  the  convergence  speed  of  the  steepest  descent  method  is 
arithmetic. 


8.2  The  Method  of  Steepest  Descent 


233 


The  Quadratic  Case 


When  /(x)  is  strongly  convex,  the  convergence  speed  can  be  increased  from  arith¬ 
metic  to  geometric  or  linear  convergence.  Since  all  of  the  important  convergence 
characteristics  of  the  method  of  steepest  descent  are  revealed  by  an  investigation  of 
the  method  when  applied  to  quadratic  problems,  we  focus  here  on 

m  =  -  *rb-  (8  33) 

where  Q  is  a  positive  definite  symmetric  n  x  n  matrix.  Since  Q  is  positive  definite, 
all  of  its  eigenvalues  are  positive.  We  assume  that  these  eigenvalues  are  ordered:  0  < 
a  =  A\  <  A2  ...<  An  =  A.  With  Q  positive  definite,  it  follows  (from  Proposition  5, 
Sect.  7.4)  that  /  is  strictly  convex. 

The  unique  minimum  point  of  /  can  be  found  directly,  by  setting  the  gradient  to 


zero,  as  the  vector  x*  satisfying 

Qx*  =  b. 

(8.34) 

Moreover,  introducing  the  function 

E(x)  =  i(x  -  x*)rQ(x  -  x*), 

(8.35) 

we  have  E(x)  =  /(x)  +  (l/2)x*rQx*  ,  which  shows  that  the  function  E  differs  from 
/  only  by  a  constant.  For  many  purposes  then,  it  will  be  convenient  to  consider  that 
we  are  minimizing  E  rather  than  /. 

The  gradient  (of  both  /  and  E)  is  given  explicitly  by 

g(x)  =  Qx  -  b.  (8.36) 


Thus  the  method  of  steepest  descent  can  be  expressed  as 

xk+i  =  xk-  ak gk,  (8.37) 

where  g^  =  Qx*-b  and  where  ak  minimizes  f(xk  -  agk).  We  can,  however,  in  this 
special  case,  determine  the  value  of  ak  explicitly.  We  have,  by  definition  (8.33), 

fixk  -  agk)  =  -  agk)TQ(xk  -  agk )  -  ( x*  -  agk)Tb, 


which  (as  can  be  found  by  differentiating  with  respect  to  a)  is  minimized  at 


— 


§[§  k 


si  Qsk 

Hence  the  method  of  steepest  descent  (8.37)  takes  the  explicit  form 

( g*& ' 


(8.38) 


Xjfc+i  =  \k 


(8.39) 


where  g^  =  Qxk  -  b. 
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The  function  /  and  the  steepest  descent  process  can  be  illustrated  as  in  Fig.  8.9  by 
showing  contours  of  constant  values  of  /  and  a  typical  sequence  developed  by  the 
process.  The  contours  of  /  are  n-dimensional  ellipsoids  with  axes  in  the  directions 
of  the  ^-mutually  orthogonal  eigenvectors  of  Q.  The  axis  corresponding  to  the  ith 
eigenvector  has  length  proportional  to  l//f  .  We  now  analyze  this  process  and  show 
that  the  rate  of  convergence  depends  on  the  ratio  of  the  lengths  of  the  axes  of  the 
elliptical  contours  of  /,  that  is,  on  the  eccentricity  of  the  ellipsoids. 


Fig.  8.9  Steepest  descent 


Lemma  2.  The  iterative  process  (8.39)  satisfies 


E(xk+ 1)  =  < 


1  - 


( slSkY 


(giQgOtelQ  lgk ) 


E(xk). 


(8.40) 


Proof.  The  proof  is  by  direct  computation.  We  have,  setting  y^  =  -  x\ 


E(&)  -  1)  =  2^g[Qyfe  -  djglQgk 

E(xk )  y[Qyk 

Using  g k  =  Qy*.  we  have 

2(g[gfc)2  _  (g[g^)2 

E(xk)  ~  E(Xk+i)  _  (g[Qgd  (g[Qg k) 

E(xk)  ~  g[Q-'gt 

=  (gti  t)2  | 

(g[Qgt)(g[Q“'gt) 


In  order  to  obtain  a  bound  on  the  rate  of  convergence,  we  need  a  bound  on  the  right- 
hand  side  of  (8.40).  The  best  bound  is  due  to  Kantorovich  and  his  lemma,  stated 
below,  is  a  useful  general  tool  in  convergence  analysis. 
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Kantorovich  inequality :  Let  Q  be  a  positive  definite  symmetric  n  x  n  matrix.  For  any  vector 
x  there  holds 


(xTxy 


4  aA 


(xTQx)(xTQ~Ix)  (a  +  A) 


2  ’ 


(8.41) 


where  a  and  A  are,  respectively,  the  smallest  and  largest  eigenvalues  of  Q. 


Proof.  Let  the  eigenvalues  A\ ,  A 2,  . . . ,  An  of  Q  satisfy 

0  <  ci  =  A\  ^  A2 ^  An  —  A. 

By  an  appropriate  change  of  coordinates  the  matrix  Q  becomes  diagonal  with  diag¬ 
onal  (A\,  A2,  . . . ,  An).  In  this  coordinate  system  we  have 

(xrx)2  CLU  xj)2 

(xrQx)(x7'Q“1x)  (2"=  1 

which  can  be  written  as 

(xrx)2  =  1/  z”=i  cy  =  <p(€) 

(xrQx)(xrQ_1x)  “  ~  #(£)’ 

where  C  =  x2/  Z"=i  x2.  We  have  converted  the  expression  to  the  ratio  of  two  func- 
tions  involving  convex  combinations;  one  a  combination  of  Af  s;  the  other  a  com¬ 
bination  of  l/Afs.  The  situation  is  shown  pictorially  in  Fig.  8.10.  The  curve  in  the 
figure  represents  the  function  1  /A.  Since  Y!U\  is  a  point  between  A\  and  An,  the 
value  of  f(f)  is  a  point  on  the  curve.  On  the  other  hand,  the  value  of  f(f)  is  a  convex 
combination  of  points  on  the  curve  and  its  value  corresponds  to  a  point  in  the  shaded 
region.  For  the  same  vector  f  both  functions  are  represented  by  points  on  the  same 
vertical  line.  The  minimum  value  of  this  ratio  is  achieved  for  some  A  =  f\A\  +  fnAn, 
with  fi  +  fn  =  1.  Using  the  relation  fi/Ai  +  fn/An  =  (Ax  +  An  -  fxAx  -  fnAn)/AiAn, 
an  appropriate  bound  is 


m  ^  r  (iM) 

- >  lim  - . 

m  (A\  +  An  -  A)/(AlAn) 

The  minimum  is  achieved  at  A  -  (A\  +  An)/ 2,  yielding 

0(f)  >  ^A\An  | 

0(f)  '  (Ai  +  An)2 ' 

Combining  the  above  two  lemmas,  we  obtain  the  central  result  on  the  convergence 
of  the  method  of  steepest  descent. 

Theorem  2  (Steepest  Descent — Quadratic  Case).  For  any  x0  e  En  the  method  of  steepest 
descent  (8.39)  converges  to  the  unique  minimum  point  x*  of  f  Furthermore,  with  E(x)  - 
\(x  -  x*)T Q(x  -  x*),  there  holds  at  every  step  k 

( A  -  a\2 

E(xk+ 1)  <  -  E(xk). 

\A  +  a) 


(8.42) 
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Proof.  By  Lemma  2  and  the  Kantorovich  inequality 

£(x**,,  <  {‘  -  ST £<x,) '  (irf)  £<x,)' 


&  r  Jt 


►  X 


Fig.  8.10  Kantorovich  inequality 


It  follows  immediately  that  — >  0  and  hence,  since  Q  is  positive  definite,  that 

xk  ->  X*.  I 

Roughly  speaking,  the  above  theorem  says  that  the  convergence  rate  of  steepest 
descent  is  slowed  as  the  contours  of  /  become  more  eccentric.  If  a  =  A,  correspond¬ 
ing  to  circular  contours,  convergence  occurs  in  a  single  step.  Note,  however,  that 
even  if  n  -  1  of  the  n  eigenvalues  are  equal  and  the  remaining  one  is  a  great  distance 
from  these,  convergence  will  be  slow,  and  hence  a  single  abnormal  eigenvalue  can 
destroy  the  effectiveness  of  steepest  descent. 

In  the  terminology  introduced  in  Sect.  7.8,  the  above  theorem  states  that  with 
respect  to  the  error  function  E  (or  equivalently  f)  the  method  of  steepest  descent 
converges  linearly  with  a  ratio  no  greater  than  [( A  -  a) /(A  +  a)]2.  The  actual  rate 
depends  on  the  initial  point  xo.  However,  for  some  initial  points  the  bound  is  actually 
achieved.  Furthermore,  it  has  been  shown  by  Akaike  that,  if  the  ratio  is  unfavorable, 
the  process  is  very  likely  to  converge  at  a  rate  close  to  the  bound.  Thus,  somewhat 
loosely  but  with  reasonable  justification,  we  say  that  the  convergence  ratio  of  steep¬ 
est  descent  is  [(A  -  a)/ (A  +  a)]2. 

It  should  be  noted  that  the  convergence  rate  actually  depends  only  on  the  ratio 
r  -  Ala  of  the  largest  to  the  smallest  eigenvalue.  Thus  the  convergence  ratio  is 
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which  clearly  shows  that  convergence  is  slowed  as  r  increases.  The  ratio  r,  which 
is  the  single  number  associated  with  the  matrix  Q  that  characterizes  convergence,  is 
often  called  the  condition  number  of  the  matrix. 

Example.  Let  us  take 


0.78  -0.02  -0.12  -0.14 
_  -0.02  0.86  -0.04  0.06 
-0.12  -0.04  0.72  -0.08 
-0.14  0.06  -0.08  0.74  _ 

b  =  (0.76,  0.08,  1.12,  0.68). 

For  this  matrix  it  can  be  calculated  that  a  -  0.52,  A  =  0.94  and  hence  r  -  1.8. 
This  is  a  very  favorable  condition  number  and  leads  to  the  convergence  ratio  [(A  - 
a)/(A  +  a)]2  =  0.081.  Thus  each  iteration  will  reduce  the  error  in  the  objective  by 
more  than  a  factor  of  ten;  or,  equivalently,  each  iteration  will  add  about  one  more 
digit  of  accuracy.  Indeed,  starting  from  the  origin  the  sequence  of  values  obtained 
by  steepest  descent  as  shown  in  Table  8.1  is  consistent  with  this  estimate. 


The  Nonquadratic  Case 

For  nonquadratic  functions,  we  expect  that  steepest  descent  will  also  do  reason¬ 
ably  well  if  the  condition  number  is  modest.  Fortunately,  we  are  able  to  establish 
estimates  of  the  progress  of  the  method  when  the  Hessian  matrix  is  always  posi¬ 
tive  definite.  Specifically,  we  assume  that  the  Hessian  matrix  is  bounded  above  and 
below  as  al  <  F(x)  <  AI.  (Thus  /  is  strongly  convex.)  We  present  three  analyses: 


Table  8.1  Solution  to  Example 


Step  k 

f(*k) 

0 

0 

1 

-2.1563625 

2 

-2.1744062 

3 

-2.1746440 

4 

-2.1746585 

5 

-2.1746595 

6 

-2.1746595 

Solution  point  x*  =  (1.534965,  0.1220097,  1.975156,  1.412954) 


1 .  Exact  Line  Search.  Given  a  point  Xk,  we  have  for  any  a 


f(xk  -  ag(xk))  «  f(xk))  -  ag(xkfg(xk)  + 


Aa2 

~Y 


g{xk)Tg(xk). 


(8.43) 
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Minimizing  both  sides  separately  with  respect  to  a  the  inequality  will  hold  for  the 
two  minima.  The  minimum  of  the  left  hand  side  is  /(x^+i).  The  minimum  of  the 
right  hand  side  occurs  at  a  -  1  /A,  yielding  the  result 

/(x*+l)  <  /(Xfc)  -  T|g(x^)|2. 

where  |g(Xk)|2  =  g(Xk)rg(Xk).  Subtracting  the  optimal  value  /*  =  fix*)  from  both 
sides  produces 

/(xk+1)  -  r  <  f(x,)  -  f  -  T|g(x,)|2.  (8.44) 

In  a  similar  way,  for  any  x  there  holds 

/(x)  >  /(xk)  +  g(xk)r(x  -  xk)  +  -|x  -  xk|2. 

Again  we  can  minimize  both  sides  separately.  The  minimum  of  the  left  hand  side  is 
/*  the  optimal  solution  value.  Minimizing  the  right  hand  side  leads  to  the  quadratic 
optimization  problem.  The  solution  is  x  =  x^  -  g (Xk)/a.  Substituting  this  x  in  the 
right  hand  side  of  the  inequality  gives 

/*>/(xfc)-T|g(x,)|2.  (8.45) 

2a 

From  (8.45)  we  have 

-  Ig(x/t)|2  <  2 a[f  -  f(xk)].  (8.46) 

Substituting  this  in  (8.44)  gives 

m+ 1)  -  r  <  a  -  a/A)\f(xk)  -  n  (8. 47) 


This  shows  that  the  method  of  steepest  descent  makes  progress  even  when  it  is  not 
close  to  the  solution. 


2. 


Other  Stopping  Criteria.  As  an  example  of  how  other  stopping  criteria  can 
be  treated,  we  examine  the  rate  of  convergence  when  using  Amijo’s  rule  with 
s  <  0.5  and  rj  >  1.  Note  first  that  the  inequality  t  >  t2  for  0  <  t  <  1  implies  by  a 
change  of  variable  that 


a2  A 

-a  + - 

2 


<  —a /2 


for  0  <  a  <  1  /A.  Then  using  (8.43)  we  have  that  for  a  <  1  /A 


f(xk  -  ag(xk))  <  f(xk)  -  a|g(x^)|2  +  0.5a2A|g(x;)|2 

<  f(xk )  -  0.5a|g(xfc)|2 

<  f(xk)  -  sa\g(xk)\2 


since  s  <  0.5.  This  means  that  the  first  part  of  the  stopping  criterion  is  satisfied 
for  a  <  1/A. 
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The  second  part  of  the  stopping  criterion  states  that  rja  does  not  satisfy  the  first 
criterion  and  thus  the  final  a  must  satisfy  a  >  1  /(rjA).  Therefore  the  inequality  of 
the  first  part  of  the  criterion  implies 

/(x*+  i)  <  /(x*)  -  ~rl§(Xfc)|2* 

rj A 

Subtracting  /*  from  both  sides, 


s 

rjA 


lg(xt)|2. 


Finally,  using  (8.46)  we  obtain 

/(x*+i)  -  /*  <  [1  -  (2sa/rjA)](f(xk)  -  /*). 

Clearly  2 sa/pA  <  1  and  hence  there  is  linear  convergence.  Notice  if  that  in  fact  s  is 
chosen  very  close  to  0.5  and  77  is  chosen  very  close  to  1,  then  the  stopping  condition 
demands  that  the  a  be  restricted  to  a  very  small  range,  and  the  estimated  rate  of 
convergence  is  very  close  to  the  estimate  obtained  above  for  exact  line  search. 

3.  Asymptotic  Convergence.  We  expect  that  as  the  points  generated  by  steepest 
descent  approach  the  solution  point,  the  convergence  characteristics  will  be  close 
to  those  inherent  for  quadratic  functions.  This  is  indeed  the  case. 

The  general  procedure  for  proving  such  a  result,  which  is  applicable  to  most 
methods  having  unity  order  of  convergence,  is  to  use  the  Hessian  of  the  objective  at 
the  solution  point  as  if  it  were  the  Q  matrix  of  a  quadratic  problem.  The  particular 
theorem  stated  below  is  a  special  case  of  a  theorem  in  Sect.  12.5  so  we  do  not  prove 
it  here;  but  it  illustrates  the  generalizability  of  an  analysis  of  quadratic  problems. 

Theorem.  Suppose  f  is  defined  on  En,  has  continuous  second  partial  derivatives,  and  has 
a  relative  minimum  at  x*.  Suppose  further  that  the  Hessian  matrix  of  f,  F( x*),  has  smallest 
eigenvalue  a  >  0  and  largest  eigenvalue  A  >  0.  If  {xA  is  a  sequence  generated  by  the 
method  of  steepest  descent  that  converges  to  x*,  then  the  sequence  of  objective  values  {f(xA) 

rs 

converges  to  fix*)  linearly  with  a  convergence  ratio  no  greater  than  [(A  -  a)/ (A  +  a)]  . 


8.3  Applications  of  the  Convergence  Theory 

Now  that  the  basic  convergence  theory,  as  represented  by  the  formula  (8.42)  for 
the  rate  of  convergence,  has  been  developed  and  demonstrated  to  actually  charac¬ 
terize  the  behavior  of  steepest  descent,  it  is  appropriate  to  illustrate  how  the  theory 
can  be  used.  Generally,  we  do  not  suggest  that  one  compute  the  numerical  value 
of  the  formula — since  it  involves  eigenvalues,  or  ratios  of  eigenvalues,  that  are  not 
easily  determined.  Nevertheless,  the  formula  itself  is  of  immense  practical  impor¬ 
tance,  since  it  allows  one  to  theoretically  compare  various  situations.  Without  such 
a  theory,  one  would  be  forced  to  rely  completely  on  experimental  comparisons. 
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Application  1  (Solution  of  Gradient  Equation).  One  approach  to  the  minimization 
of  a  function  /  is  to  consider  solving  the  equations  V/(x)  =  0  that  represent  the 
necessary  conditions.  It  has  been  proposed  that  these  equations  could  be  solved  by 
applying  steepest  descent  to  the  function  h(x)  =  |V/(x)|2.  One  advantage  of  this 
method  is  that  the  minimum  value  is  known.  We  ask  whether  this  method  is  likely 
to  be  faster  or  slower  than  the  application  of  steepest  descent  to  the  original  function 
/  itself. 

For  simplicity  we  consider  only  the  case  where  /  is  quadratic.  Thus  let  /(x)  = 
(l/2)xrQx  -  brx.  Then  the  gradient  of  /  is  g(x)  =  Qx  -  b,  and  h(x)  =  |g(x)|2  = 
xrQ2x  -  2xrQb  +  brb.  Thus  h(x)  is  itself  a  quadratic  function.  The  rate  of  conver¬ 
gence  of  steepest  descent  applied  to  h  will  be  governed  by  the  eigenvalues  of  the 
matrix  Q2.  In  particular  the  rate  will  be 


where  f  is  the  condition  number  of  the  matrix  Q2.  However,  the  eigenvalues  of  Q2 
are  the  squares  of  those  of  Q  itself,  so  f  =  r2,  where  r  is  the  condition  number  of  Q, 
and  it  is  clear  that  the  convergence  rate  for  the  proposed  method  will  be  worse  than 
for  steepest  descent  applied  to  the  original  function. 

We  can  go  further  and  actually  estimate  how  much  slower  the  proposed  method 
is  likely  to  be.  If  r  is  large,  we  have 


steepest  descent  rate  =  - -  —  ( 1  —  1  /  r)4 

\r+l/ 

/ r 2  -  1\2 

proposed  method  rate  =  I  2  I  ^  (1  -  1  / r2)4. 

Since  (1  -  1  / r2)r  ^  1  —  1  / r ,  it  follows  that  it  takes  about  r  steps  of  the  new  method  to 
equal  one  step  of  ordinary  steepest  descent.  We  conclude  that  if  the  original  problem 
is  difficult  to  solve  with  steepest  descent,  the  proposed  method  will  be  quite  a  bit 
worse. 

Application  2  (Penalty  Methods).  Let  us  briefly  consider  a  problem  with  a  single 
constraint: 


minimize  /(x)  (8.48) 

subject  to  h(x)  =  0. 

One  method  for  approaching  this  problem  is  to  convert  it  (at  least  approximately)  to 
the  unconstrained  problem 


minimize  /(x)  +  -jdh(x) 


2 


(8.49) 
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where  //  is  a  (large)  penalty  coefficient.  Because  of  the  penalty,  the  solution  to  (8.49) 
will  tend  to  have  a  small  h(x).  Problem  (8.49)  can  be  solved  as  an  unconstrained 
problem  by  the  method  of  steepest  descent.  How  will  this  behave? 

For  simplicity  let  us  consider  the  case  where  /  is  quadratic  and  h  is  linear.  Specif¬ 
ically,  we  consider  the  problem 


1  7~T  rp 

minimize  -x  Ox  -  b  x 
2 

'T 

subject  to  c  x  =  0. 


(8.50) 


The  objective  of  the  associated  penalty  problem  is  (1  /2){xrQx+/ixrccrx}-brx.  The 
quadratic  form  associated  with  this  objective  is  defined  by  the  matrix  Q  +  /iccr  and, 
accordingly,  the  convergence  rate  of  steepest  descent  will  be  governed  by  the  condi¬ 
tion  number  of  this  matrix.  This  matrix  is  the  original  matrix  Q  with  a  large  rank-one 
matrix  added.  It  should  be  fairly  clear1  that  this  addition  will  cause  one  eigenvalue 
of  the  matrix  to  be  large  (on  the  order  of  /i).  Thus  the  condition  number  is  roughly 
proportional  to  /i.  Therefore,  as  one  increases  /i  in  order  to  get  an  accurate  solution 
to  the  original  constrained  problem,  the  rate  of  convergence  becomes  extremely 
poor.  We  conclude  that  the  penalty  function  method  used  in  this  simplistic  way  with 
steepest  descent  will  not  be  very  effective.  (Penalty  functions,  and  how  to  minimize 
them  more  rapidly,  are  considered  in  detail  in  Chap.  11.) 


Scaling 

The  performance  of  the  method  of  steepest  descent  is  dependent  on  the  particular 
choice  of  variables  x  used  to  define  the  problem.  A  new  choice  may  substantially 
alter  the  convergence  characteristics. 

Suppose  that  T  is  an  invertible  nxn  matrix.  We  can  then  represent  points  in  En 
either  by  the  standard  vector  x  or  by  y  where  Ty  =  x.  The  problem  of  finding  x  to 
minimize  /(x)  is  equivalent  to  that  of  finding  y  to  minimize  h( y)  =  /(Ty).  Using  y 
as  the  underlying  set  of  variables,  we  then  have 

Vh  =  V/T,  (8.51) 

where  V/  is  the  gradient  of  /  with  respect  to  x.  Thus,  using  steepest  descent,  the 
direction  of  search  will  be 

Vy  =  -T  TVf,  (8.52) 

which  in  the  original  variables  is 

Ax  =  -TTrV/r.  (8.53) 


'  See  the  Interlocking  Eigenvalues  Lemma  in  Sect.  10.6  for  a  proof  that  only  one  eigenvalue 
becomes  large. 
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Thus  we  see  that  the  change  of  variables  changes  the  direction  of  search. 

The  rate  of  convergence  of  steepest  descent  with  respect  to  y  will  be  determined 
by  the  eigenvalues  of  the  Hessian  of  the  objective,  taken  with  respect  to  y.  That 
Hessian  is 

V2ft(y)  ee  H(y)  =  TrF(Ty)T. 

Thus,  if  x*  =  Ty*  is  the  solution  point,  the  rate  of  convergence  is  governed  by  the 
matrix 

H(y*)  =  TrF(x*)T.  (8.54) 

Very  little  can  be  said  in  comparison  of  the  convergence  ratio  associated  with  H 
and  that  of  F.  If  T  is  an  orthonormal  matrix,  corresponding  to  y  being  defined  from 
x  by  a  simple  rotation  of  coordinates,  then  TrT  =  I,  and  we  see  from  (8.48)  that  the 
directions  remain  unchanged  and  the  eigenvalues  of  H  are  the  same  as  those  of  F. 

In  general,  before  attacking  a  problem  with  steepest  descent,  it  is  desirable,  if  it  is 
feasible,  to  introduce  a  change  of  variables  that  leads  to  a  more  favorable  eigenvalue 
structure.  Usually  the  only  kind  of  transformation  that  is  at  all  practical  is  one  having 
T  equal  to  a  diagonal  matrix,  corresponding  to  the  introduction  of  scale  factors  on 
each  of  the  variables.  One  should  strive,  in  doing  this,  to  make  the  second  derivatives 
with  respect  to  each  variable  roughly  the  same.  Although  appropriate  scaling  can 
potentially  lead  to  substantial  payoff  in  terms  of  enhanced  convergence  rate,  we 
largely  ignore  this  possibility  in  our  discussions  of  steepest  descent.  However,  see 
the  next  application  for  a  situation  that  frequently  occurs. 

Application  3  (Program  Design).  In  applied  work  it  is  extremely  rare  that  one 
solves  just  a  single  optimization  problem  of  a  given  type.  It  is  far  more  usual  that 
once  a  problem  is  coded  for  computer  solution,  it  will  be  solved  repeatedly  for 
various  parameter  values.  Thus,  for  example,  if  one  is  seeking  to  find  the  optimal 
production  plan  (as  in  Example  2  of  Sect.  7.2),  the  problem  will  be  solved  for  the 
different  values  of  the  input  prices.  Similarly,  other  optimization  problems  will  be 
solved  under  various  assumptions  and  constraint  values.  It  is  for  this  reason  that 
speed  of  convergence  and  convergence  analysis  is  so  important.  One  wants  a  pro¬ 
gram  that  can  be  used  efficiently.  In  many  such  situations,  the  effort  devoted  to 
proper  scaling  repays  itself,  not  with  the  first  execution,  but  in  the  long  run. 

As  a  simple  illustration  consider  the  problem  of  minimizing  the  function 

f(x )  =  x2  -  5xy  +  y4  -  ax  -  by. 

It  is  desirable  to  obtain  solutions  quickly  for  different  values  of  the  parameters  a 
and  b.  We  begin  with  the  values  a  =  25,  b  -  8. 

The  result  of  steepest  descent  applied  to  this  problem  directly  is  shown  in  Ta¬ 
ble  8.2,  column  (a).  It  requires  eighty  iterations  for  convergence,  which  could  be 
regarded  as  disappointing. 

The  reason  for  this  poor  performance  is  revealed  by  examining  the  Hessian 
matrix 


8.4  Accelerated  Steepest  Descent 


243 


Using  the  results  of  our  first  experiment,  we  know  that  y  =  3.  Hence  the  diagonal 
elements  of  the  Hessian,  at  the  solution,  differ  by  a  factor  of  54.  (In  fact,  the  condi¬ 
tion  number  is  about  61 .)  As  a  simple  remedy  we  scale  the  problem  by  replacing  the 
variable  y  by  z  -  ty.  The  new  lower  right-corner  term  of  the  Hessian  then  becomes 
12 z2/t4,  which  has  magnitude  12  x  t2  x  3 2/t4  =  108/t2.  Thus  we  might  put  t  =  7  in 
order  to  make  the  two  diagonal  terms  approximately  equal.  The  result  of  applying 
steepest  descent  to  the  problem  scaled  this  way  is  shown  in  Table  8.2,  column  (b). 
(This  superior  performance  is  in  accordance  with  our  general  theory,  since  the  con¬ 
dition  number  of  the  scaled  problem  is  about  two.)  For  other  nearby  values  of  a  and 
b,  similar  speeds  will  be  attained. 


Table  8.2  Solution  to  scaling  application 


Iteration 

no. 

Value  of  / 

(a) 

Unsealed 

(b) 

Scaled 

0 

0.0000 

0.0000 

1 

-230.9958 

-162.2000 

2 

-256.4042 

-289.3124 

4 

-293.1705 

-341.9802 

6 

-313.3619 

-342.9865 

8 

-324.9978 

-342.9998 

9 

-329.0408 

-343.0000 

15 

-339.6124 

20 

-341.9022 

25 

-342.6004 

30 

-342.8372 

35 

-342.9275 

40 

-342.9650 

45 

-342.9825 

50 

-342.9909 

55 

-342.9951 

60 

-342.9971 

Solution 

65 

-342.9883 

x  =  20.0 

70 

-342.9990 

II 

b 

75 

-342.9994 

80 

-342.9997 

8.4  Accelerated  Steepest  Descent 


There  is  an  accelerated  steepest  descent  method  that  works  as  follows: 


A°  =  0,  4+1  = 


l+Vi+4  (A)2 

9  ? 


1-4 

4+i 


xk+i  =  Xk  -  |V/(XyOr,  X£+1  =  (1  -  ak)xk+ 1  +  akxk. 


(8.55) 

(8.56) 
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Note  that  ( A k)2  -  /U+i(/U+i  -  1),  Ak  >  k/2  and  ak  <  0.  One  can  prove: 


Theorem  (Accelerated  Steepest  Descent).  Let  f(x)  be  convex  and  differentiable  every¬ 
where,  satisfies  the  ( first-order )  fi-Lipschitz  condition,  and  admits  a  minimizer  x*.  Then, 
the  method  of  accelerated  steepest  descent  generates  a  sequence  of  solutions  such  that 

f(,xk+ 1)  -  fix’")  ,  Vk  >  1. 

Proof.  We  now  let  =  Akxk  -  (Ak  -  I  )xk  -  x\  and  8k  =  f(xk)  -  /(x*)(>  0). 
Applying  Lemma  1  for  x  =  xk+i  and  y  =  xk,  convexity  of  /  and  (8.56),  we  have 


8k+\  ~  8k  =  /(x*+  i)  -  /(xO  +  f(*k)  ~  /(V) 

-  xk\2  +  f(xk )  -  f(xk) 

-  xk\2  +  ( gk)T(xk  -  xk) 
-xk\2  ~/3(xk+ 1  -xk)T(xk 


<  -jlX/t+l 

,  P\~ 

<  —  ^IXfe+1 

=  -f  \xk+i 


(8.57) 


-V). 


Applying  Lemma  1  for  x  =  x^+i  and  y  =  x*,  convexity  of  /  and  (8.56),  we  have 


8k+l  =  f(xk+ 1)  -  f{xk)  +  f(xk)  -  f(x*) 

-  *k\2  +  f(*k)  -  /(X*) 

-  x^l2  +  ( gk)T(xk  -  X*) 
-xk\2  ~P(xk+i  -xk)T(xk 


<  -f  IV+i 

<  —  f  |xfe+i 
=  -f  iXfc+1 


(8.58) 


-X*). 


Multiplying  (8.57)  by  AffA k  -  1)  and  (8.58)  by  Ak  respectively,  and  summing  the 
two,  we  have 

(Ak)2Sk+i  ~  (Ak- \)2Sk 

<  ~Ok)2j |Xfe+i  -  Xfc|2  -  Ak/3(xk+ 1  -  xk)Tdk 
=  -f  ((Ak)2\xk+l  -  xk\2  +  2Ak(xk+i  -  xk)Tdk ) 

=  -f  (| Akxk+l  -  (Ak  -  l)x*  -  x*|2  -  |dfc|2) 

=  f  (\dk\2  -  \Akxk+l  -  ( Ak  -  l)xk  -  x*|2). 


Using  (8.55)  and  (8.56)  we  derive 


Thus, 


Ak^-k+i  (Ak  1)^£  —  Ak+i^-k+i  (Ak+  i  l)x&+i . 


(Ak)2<5k+ 1  -  (Ak-\)25k  <  ^(|d^|2  -  |d^+i|2.) 


(8.59) 


Summing  up  (8.59)  from  1  to  k  we  have 


c  <  P  | j  i2  <  i2 

k+V  ~  2(Ak)2  |dl1  “  k2  d° 


where  we  used  facts  Ak  >  k/2  and  |di|  <  |do|.  I 
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The  Method  of  False  Position 

Yet  there  is  another  steepest  descent  method,  commonly  called  the  BB  method,  that 
works  as  follows: 


Ak  =xk-  x*_i  and  A\  =  V/(x*)  -  V/(xt_i),  (8.60) 


_  WM  _  W)Tdxk 

ak  ( AgkYA{  01  ak  (Ax)TAf 

Then 

xk+i  =xk-  ak gk  =xk-  akVf(xk)T.  (8.61) 


The  step  size  of  the  BB  method  resembles  the  one  used  in  quadratic  curve  fitting 


discussed  for  line  search.  There,  the  step  size  of  (8.10)  is  given  as 


Xk- 1  Xk 


f(xk-\)-f(xk) 


If 


we  let  6xk-Xk~  Xk-i  and  6§k  =  f'(xk)  -  f'(xk- 1),  this  quantity  can  be  written  as 


k  k 

o si >2 


m 


x\2 


or  .  In  the  vector  case,  multiplication  is  replaced  by  inner  product. 

k  k 


There  was  another  explanation  on  the  step  size  of  the  BB  method.  Consider  con¬ 
vex  quadratic  minimization,  and  let  the  distinct  positive  eigenvalues  of  the  Hessian 
Q  be  Ti,  A2, . . .  Ar.  Then,  if  we  let  the  step  size  in  the  method  of  steepest  descent  be 
ak  -  k  =  1, . . . ,  K,  the  method  terminates  in  K  iterations  (which  we  leave  as  an 
exercise).  In  the  BB  method,  ak  minimizes 


\Axk  -  aA\ |  =  \Ak  -  aQAxl. 

If  the  error  becomes  0  plus  \Axk\  ±  0,  ^  will  be  a  positive  eigenvalue  of  Q.  Notice  that 
the  objective  values  of  the  iterates  generated  by  the  BB  method  is  not  monotonically 
decreasing;  the  method  may  overshoot  in  order  to  have  a  better  position  in  the  long 
run. 


8.5  Newton’s  Method 

The  idea  behind  Newton’s  method  is  that  the  function  /  being  minimized  is  approx¬ 
imated  locally  by  a  quadratic  function,  and  this  approximate  function  is  minimized 
exactly.  Thus  near  we  can  approximate  /  by  the  truncated  Taylor  series 

f(x)  =i  f(xk)  +  Vf(xk)(x  -  xk)  +  l(x  -  xk)TF(xk)(x  -  xk). 

The  right-hand  side  is  minimized  at 


X/t+1  =  X*  —  [F(xjt)]  lVf(xk)T, 


(8.62) 
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and  this  equation  is  the  pure  form  of  Newton’s  method. 

In  view  of  the  second-order  sufficiency  conditions  for  a  minimum  point,  we  as¬ 
sume  that  at  a  relative  minimum  point,  x*,  the  Hessian  matrix,  F(x*),  is  positive 
definite.  We  can  then  argue  that  if  /  has  continuous  second  partial  derivatives,  F(x) 
is  positive  definite  near  x*  and  hence  the  method  is  well  defined  near  the  solution. 


Order  Two  Convergence 

Newton’s  method  has  very  desirable  properties  if  started  sufficiently  close  to  the 
solution  point.  Its  order  of  convergence  is  two. 

Theorem  (Newton’s  Method).  Let  f  e  C3  on  En,  and  assume  that  at  the  local  minimum 
point  x*,  the  Hessian  F(jc*)  is  positive  definite.  Then  if  started  sufficiently  close  to  x*,  the 
points  generated  by  Newton’s  method  converge  to  x*.  The  order  of  convergence  is  at  least 
two. 

Proof.  There  are  p  >  0,  fd\  >  0,  >  0  such  that  for  all  x  with  |x  -  x*|  <  p,  there 

holds  |F(x)-1|  <  f3\  (see  Appendix  A  for  the  definition  of  the  norm  of  a  matrix)  and 
| V/(x*)r  -  V/(x)r  -  F(x)(x*  -  x)|  <  /?2|x  -  x*|2.  Now  suppose  x^  is  selected  with 
~  x*|  <  1  and  \Xk  -  x*|  <  p.  Then 

|xfc+1  -  x*|  =  \xk  -x*  -  F(xk)~lVf(xk)T\ 

=  \F(xkyl[Vf(x*)T  -  V/(X/t)r  -  F(*)(x*  -  x*)]| 

<  |F(Xjfc)_1|/82|x*r  —  x*|2 

<  ySiyS2|x*  -X*|2  <  |Xt-X*|. 

The  final  inequality  shows  that  the  new  point  is  closer  to  x*  than  the  old  point,  and 
hence  all  conditions  apply  again  to  x^+i .  The  previous  inequality  establishes  that 
convergence  is  second  order.  I 


Modifications 

Although  Newton’s  method  is  very  attractive  in  terms  of  its  convergence  properties 
near  the  solution,  it  requires  modification  before  it  can  be  used  at  points  that  are 
remote  from  the  solution.  The  general  nature  of  these  modifications  is  discussed  in 
the  remainder  of  this  section. 

1 .  Damping.  The  first  modification  is  that  usually  a  search  parameter  a  is  intro¬ 
duced  so  that  the  method  takes  the  form 


x*+1  =xk-  a*[F(x*)]  'VfiXkf, 
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where  ak  is  selected  to  minimize  /.  Near  the  solution  we  expect,  on  the  basis  of 
how  Newton’s  method  was  derived,  that  ak  -  1.  Introducing  the  parameter  for 
general  points,  however,  guards  against  the  possibility  that  the  objective  might 
increase  with  ak  =  1,  due  to  nonquadratic  terms  in  the  objective  function. 

2.  Positive  Definiteness.  A  basic  consideration  for  Newton’s  method  can  be  seen 
most  clearly  by  a  brief  examination  of  the  general  class  of  algorithms 

x^+i  =  x*  -  aMk gk,  (8.63) 

where  is  an  nxn  matrix,  a  is  a  positive  search  parameter,  and  gk  =  V/(x^)r. 
We  note  that  both  steepest  descent  (Mk  =  I)  and  Newton’s  method  (M^  = 
[F(x^)]-1)  belong  to  this  class.  The  direction  vector  =  -Mkgk  obtained  in 
this  way  is  a  direction  of  descent  if  for  small  a  the  value  of  /  decreases  as  a 
increases  from  zero.  For  small  a  we  can  say 


/(X*+l)  =  f(X, t)  +  V/(X*)(X*+i  -  Xk )  +  0(|xfr+i  -  Xk\2). 
Employing  (8.51)  this  can  be  written  as 


/(Xfc+i)  =  /(Xfc)  -  ag[Mkgk  +  0(a2). 

As  a  — >  0,  the  second  term  on  the  right  dominates  the  third.  Hence  if  one  is  to 
guarantee  a  decrease  in  /  for  small  a,  we  must  have  g^Mkgk  >  0.  The  simplest 
way  to  insure  this  is  to  require  that  be  positive  definite. 

The  best  circumstance  is  that  where  F(x)  is  itself  positive  definite  throughout 
the  search  region.  The  objective  function  of  many  important  optimization  problems 
have  this  property,  including  for  example  interior-point  approaches  to  linear  pro¬ 
gramming  using  the  logarithm  as  a  barrier  function.  Indeed,  it  can  be  argued  that 
convexity  is  an  inherent  property  of  the  majority  of  well-formulated  optimization 
problems. 

Therefore,  assume  that  the  Hessian  matrix  F(x)  is  positive  definite  throughout 
the  search  region  and  that  /  has  continuous  third  derivatives.  At  a  given  xk  define 
the  symmetric  matrix  T  =  F(x^)_1/2.  As  in  Sect.  8.3  introduce  the  change  of  variable 
Ty  =  x.  Then  according  to  (8.48)  a  steepest  descent  direction  with  respect  to  y  is 
equivalent  to  a  direction  with  respect  to  x  of  d  =  -TT  g(x^),  where  g(xk)  is  the 
gradient  of  /  with  respect  to  x  at  xk.  Thus,  d  =  F_1g(x^).  In  other  words,  a  steepest 
descent  direction  in  y  is  equivalent  to  a  Newton  direction  in  x. 

We  can  turn  this  relation  around  to  analyze  Newton  steps  in  x  as  equivalent  to 
gradient  steps  in  y.  We  know  that  convergence  properties  in  y  depend  on  the  bounds 
on  the  Hessian  matrix  given  by  (8.49)  as 

H(y)  =  TrF(x)T  =  F-1/2F(x)F-1/2.  (8.64) 

Recall  that  F  =  F(x^)  which  is  fixed,  whereas  F(x)  denotes  the  general  Hessian 
matrix  with  respect  to  x  near  xk.  The  product  (8.64)  is  the  identity  matrix  at  yk 
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but  the  rate  of  convergence  of  steepest  descent  in  y  depends  on  the  bounds  of  the 
smallest  and  largest  eigenvalues  of  H(y)  in  a  region  near  y^. 

These  observations  tell  us  that  the  damped  method  of  Newton’s  method  will  con¬ 
verge  at  a  linear  rate  at  least  as  fast  as  c  =  (1  -  a /A)  where  a  and  A  are  lower 
and  upper  bounds  on  the  eigenvalues  of  F(xo)-1/2F(x°)F(xo)-1/2,  where  xo  and  x° 
are  arbitrary  points  in  the  local  search  region.  These  bounds  depend,  in  turn,  on 
the  bounds  of  the  third-order  derivatives  of  /.  It  is  clear,  however,  by  continuity  of 
F(x)  and  its  derivatives,  that  the  rate  becomes  very  fast  near  the  solution,  becoming 
superlinear,  and  in  fact,  as  we  know,  quadratic. 

3.  Backtracking.  The  backtracking  method  of  line  search,  using  a  =  1  as  the 
initial  guess,  is  an  attractive  procedure  for  use  with  Newton’s  method.  Using 
this  method  the  overall  progress  of  Newton’s  method  divides  naturally  into  two 
phases:  first  a  damping  phase  where  backtracking  may  require  a  <  1,  and  sec¬ 
ond  a  quadratic  phase  where  a  =  1  satisfies  the  backtracking  criterion  at  every 
step.  The  damping  phase  was  discussed  above. 

Let  us  now  examine  the  situation  when  close  to  the  solution.  We  assume  that  all 
derivatives  of  /  through  the  third  are  continuous  and  uniformly  bounded.  We  also 
assume  that  in  the  region  close  to  the  solution,  F(x)  is  positive  definite  with  a  >  0 
and  A  >  0  being,  respectively,  uniform  lower  and  upper  bounds  on  the  eigenvalues 
of  F(x).  Using  a  =  1  and  £  <  0.5  we  have  for  dk  =  -F(x^)_1g(x^) 

/(X*  +  djt)  =  /(x*)  -  g(xk)TF(xk)~lg(xk)  +  lg(x*:)rF(x/t)_1g(x/t)  +  o(|g(x*)|2) 

=  /(x*)  -  Ig(x/t)rF(x/t)_1g(x/t)  +  o(|g(x*)|2) 

<  f(xk)  -  sg(xk)TF(xk)~xg{xk)  +  o(|g(x*)|2), 

where  the  o  bound  is  uniform  for  all  x&.  Since  |g(x^)|  — >  0  (uniformly)  as  x^  — >  x*,  it 
follows  that  once  x^  is  sufficiently  close  to  x*,  then  /(x^+d^)  <  /(x^)-6:g(x^)rdk  and 
hence  the  backtracking  test  (the  first  part  of  Amijo’s  rule)  is  satisfied.  This  means 
that  a  =  1  will  be  used  throughout  the  final  phase. 

4.  General  Problems.  In  practice,  Newton’s  method  must  be  modified  to  ac¬ 
commodate  the  possible  nonpositive  definiteness  at  regions  remote  from  the 
solution. 

A  common  approach  is  to  take  =  [^I  +  F(x^)]-1  for  some  non-negative  value 
of  Sk.  This  can  be  regarded  as  a  kind  of  compromise  between  steepest  descent  (sk 
very  large)  and  Newton’s  method  (sk  =  0).  There  is  always  an  Sk  that  makes 
positive  definite.  We  shall  present  one  modification  of  this  type. 

Let  Fk  =  F(xfc).  Fix  a  constant  6  >  0.  Given  x*,  calculate  the  eigenvalues  of  F^ 
and  let  Sk  be  the  smallest  nonnegative  constant  for  which  the  matrix  +  F^  has 
eigenvalues  greater  than  or  equal  to  5.  Then  define 


d/c  =  -('&>!  +  F*.)  lgk 


(8.65) 
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and  iterate  according  to 

xk+i  =xk  +  akdk,  (8.66) 

where  minimizes  f(xk  +  adk),  a  >  0. 

This  algorithm  has  the  desired  global  and  local  properties.  First,  since  the  eigen¬ 
values  of  a  matrix  depend  continuously  on  its  elements,  Sk  is  a  continuous  function 
of  Xk  and  hence  the  mapping  D  :  En  — >  E2n  defined  by  D(x^)  =  (xk,  d*)  is  con¬ 
tinuous.  Thus  the  algorithm  A  =SD  is  closed  at  points  outside  the  solution  set 
Q  -  [x  :  V/(x)  =  0}.  Second,  since  6^1+ F^  is  positive  definite,  is  a  descent  direc¬ 
tion  and  thus  Z(x)  =  fix)  is  a  continuous  descent  function  for  A.  Therefore,  assum¬ 
ing  the  generated  sequence  is  bounded,  the  Global  Convergence  Theorem  applies. 
Furthermore,  if  5  >  0  is  smaller  than  the  smallest  eigenvalue  of  F(x*),  then  for  x^ 
sufficiently  close  to  x*  we  will  have  £k  -  0,  and  the  method  reduces  to  Newton’s 
method.  Thus  this  revised  method  also  has  order  of  convergence  equal  to  two. 

The  selection  of  an  appropriate  6  is  somewhat  of  an  art.  A  small  6  means  that 
nearly  singular  matrices  must  be  inverted,  while  a  large  6  means  that  the  order 
two  convergence  may  be  lost.  Experimentation  and  familiarity  with  a  given  class 
of  problems  are  often  required  to  find  the  best  6. 

The  utility  of  the  above  algorithm  is  hampered  by  the  necessity  to  calculate  the 
eigenvalues  of  F(x^),  and  in  practice  an  alternate  procedure  is  used.  In  one  class 
of  methods  (Levenberg-Marquardt  type  methods),  for  a  given  value  of  Sk,  Cholesky 
factorization  of  the  form  £kI+F(xk)  =  GG  (see  Exercise  6  of  Chap.  7)  is  employed 
to  check  for  positive  definiteness.  If  the  factorization  breaks  down,  Sk  is  increased. 
The  factorization  then  also  provides  the  direction  vector  through  solution  of  the 

rr 

equations  GG  d k  =  g*.,  which  are  easily  solved,  since  G  is  triangular.  Then  the 
value  f(xk  +  dfc)  is  examined.  If  it  is  sufficiently  below  /(x^),  then  x^+i  is  accepted 
and  a  new  Sk+\  is  determined.  Essentially,  s  serves  as  a  search  parameter  in  these 
methods.  It  should  be  clear  from  this  discussion  that  the  simplicity  that  Newton’s 
method  first  seemed  to  promise  is  not  fully  realized  in  practice. 


Newton ’s  Method  and  Logarithms 

Interior  point  methods  of  linear  and  nonlinear  programming  use  barrier  functions, 
which  usually  are  based  on  the  logarithm.  For  linear  programming  especially,  this 
means  that  the  only  nonlinear  terms  are  logarithms.  Newton’s  method  enjoys  some 
special  properties  in  this  case, 

To  illustrate,  let  us  apply  Newton’s  method  to  the  one-dimensional  problem 

min [tx  -  In  x]  (8.67) 

where  ns  a  positive  parameter.  The  derivative  at  v  is 

f(x)  =  t--, 

x 
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and  of  course  the  solution  is  v*  =  l/t,  or  equivalently  1  -  tx*  =  0.  The  second 
derivative  is  f'(x)  =  l/x2.  Denoting  by  v+  the  result  of  one  step  of  a  pure  Newton’s 
method  (with  step  length  equal  to  1)  applied  to  the  point  x,  we  find 


=  x  -  [/"(*)]  1  fix)  =  v  -  v2 
—  2x  —  tx2. 


+  x 


Thus 

1  -  tx+  =  1  -  2 tx  +  x2t2  =  (1  -  tx)2  (8.68) 

Therefore,  rather  surprisingly,  the  quadratic  nature  of  convergence  of  (1  -  tx)  — >  0 
is  directly  evident  and  exact.  Expression  (8.68)  represents  a  reduction  in  the  error 
magnitude  only  if  |(1  -  tx)\  <  1,  or  equivalently,  0  <  v  <  2/t.  If  x  is  too  large, 
then  Newton’s  method  must  be  used  with  damping  until  the  region  0<v<2/tis 
reached.  From  then  on,  a  step  size  of  1  will  exhibit  pure  quadratic  error  reduction. 

The  situation  is  shown  in  Fig.  8. 1 F  The  graph  is  that  of  fix)  =  t—  1  /  v.  The  root¬ 
finding  form  of  Newton’s  method  (Sect.  8.1)  is  then  applied  to  this  function.  At  each 
point,  the  tangent  line  is  followed  to  the  v  axis  to  find  the  new  point.  The  starting 
value  marked  x\  is  far  from  the  solution  l/t  and  hence  following  the  tangent  would 
lead  to  a  new  point  that  was  negative.  Damping  must  be  applied  at  that  starting  point. 
Once  a  point  v  is  reached  with  0  <  v  <  1  /t,  all  further  points  will  remain  to  the  left 
of  1  /t  and  move  toward  it  quadratically. 


Fig.  8.11  Newton’s  method  applied  to  minimization  of  tx  -  In  x 


In  interior  point  methods  for  linear  programming,  a  logarithmic  barrier  function 
is  applied  separately  to  the  variables  that  must  remain  positive.  The  convergence 
analysis  in  these  situations  is  an  extension  of  that  for  the  simple  case  given  here, 
allowing  for  estimates  of  the  rate  of  convergence  that  do  not  require  knowledge  of 
bounds  of  third-order  derivatives. 
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Self-concordant  Functions 

The  special  properties  exhibited  above  for  the  logarithm  have  been  extended  to  the 
general  class  of  self-concordant  functions  of  which  the  logarithm  is  the  primary 
example.  A  function  /  defined  on  the  real  line  is  self-concordant  if  it  satisfies 

\f"\x)\  <  2 :/"(x)3/2,  (8.69) 

throughout  its  domain.  It  is  easily  verified  that  f(x)  =  -  In  x  satisfies  this  inequality 
with  equality  for  v  >  0. 

Self-concordancy  is  preserved  by  the  addition  of  an  affine  term  since  such  a  term 
does  not  affect  the  second  or  third  derivatives. 

A  function  defined  on  En  is  said  to  be  self-concordant  if  it  is  self-concordant  in 
every  direction:  that  is  if  /(x  +  crd)  is  self-concordant  with  respect  to  a  for  every  d 
throughout  the  domain  of  /. 

Self-concordant  functions  can  be  combined  by  addition  and  even  by  composition 
with  affine  functions  to  yield  other  self-concordant  functions.  (See  Exercise  29.)  For 
example  the  function 

m 

/(x)  = 

i=  1 

often  used  in  interior  point  methods  for  linear  programming,  is  self-concordant. 

When  a  self-concordant  function  is  subjected  to  Newton’s  method,  the  quadratic 
convergence  of  final  phase  can  be  measured  in  terms  of  the  function 

/((X)  =  [V/(x)F(x)-1V/(x)r]1/2, 

where  as  usual  F(x)  is  the  Hessian  matrix  of  /  at  x.  Then  it  can  be  shown  that  close 
to  the  solution 

2 Mxk+l)  <  [2A(xic)]2.  (8.70) 

Furthermore,  in  a  backtracking  procedure,  estimates  of  both  the  stepwise  progress 
in  the  damping  phase  and  the  point  at  which  the  quadratic  phase  begins  can  be 
expressed  in  terms  of  parameters  that  depend  only  on  the  backtracking  parameters. 
Although,  this  knowledge  does  not  generally  influence  practice,  it  is  theoretically 
quite  interesting. 

Example  1  ( The  Logarithmic  Case).  Consider  the  earlier  example  of  f(x)  -  tx- In  v. 
There 

m  =  [f'(x)2/f'(x)]i  =  \(t-  \/x)x\  =  \\-tx\. 

Then  (8.70)  gives 

(1  -tx+)  <  2(1  -tx)2. 

Actually,  for  this  example,  as  we  found  in  (8.68),  the  factor  of  2  is  not  required. 
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There  is  a  relation  between  the  analysis  of  self-concordant  functions  and  our 
earlier  convergence  analysis. 

Recall  that  one  way  to  analyze  Newton’s  method  is  to  change  variables  from  x  to 
y  according  to  y  =  [F(x)]_(1/2)x,  where  here  x  is  a  reference  point  and  x  is  variable. 
The  gradient  with  respect  to  y  at  y  is  then  F(x)_(1/2)V/(x),  and  hence  the  norm  of  the 
gradient  at  y  is  [V/(x)F(x)_1  V/(x)r](1/2)  =  A(x).  Hence  it  is  perhaps  not  surprising 
that  A(x)  plays  a  role  analogous  to  the  role  played  by  the  norm  of  the  gradient  in  the 
analysis  of  steepest  descent. 


8.6  Coordinate  Descent  Methods 

The  algorithms  discussed  in  this  section  are  sometimes  attractive  because  of  their 
easy  implementation.  Generally,  however,  their  convergence  properties  are  poorer 
than  steepest  descent. 

Let  /  be  a  function  on  En  having  continuous  first  partial  derivatives.  Given  a 
point  x  =  (x\,  X2,  . . . ,  xn),  descent  with  respect  to  the  coordinate  xi  ( i  fixed)  means 
that  one  solves 

minimize  f(x i, 

9***9  ^  * 

Xi 

Thus  only  changes  in  the  single  component  Xi  are  allowed  in  seeking  a  new  and 
better  vector  x  (one  can  also  consider  x;  the  ith  block  of  variables,  called  the  block- 
coordinate  method).  In  our  general  terminology,  each  such  descent  can  be  regarded 
as  a  descent  in  the  direction  e/(or  -e*)  where  e*  is  the  ith  unit  vector.  By  sequentially 
minimizing  with  respect  to  different  components,  a  relative  minimum  of  /  might 
ultimately  be  determined. 

There  are  a  number  of  ways  that  this  concept  can  be  developed  into  a  full  algo¬ 
rithm.  The  cyclic  coordinate  descent  algorithm  minimizes  /  cyclically  with  respect 
to  the  coordinate  variables.  Thus  x\  is  changed  first,  then  V2  and  so  forth  through  xn. 
The  process  is  then  repeated  starting  with  x\  again.  A  variation  of  this  is  the  Aitken 
double  sweep  method.  In  this  procedure  one  searches  over  x\,  X2,  . . . ,  xn,  in  that 
order,  and  then  comes  back  in  the  order  xn-\,  xn-2 ,  . . . ,  x\.  These  cyclic  meth¬ 
ods  have  the  advantage  of  not  requiring  any  information  about  V/  to  determine  the 
descent  directions. 

If  the  gradient  of  /  is  available,  then  it  is  possible  to  select  the  order  of  descent  co¬ 
ordinates  on  the  basis  of  the  gradient.  A  popular  technique  is  the  Gauss-Southwell 
Method  where  at  each  stage  the  coordinate  corresponding  to  the  largest  (in  absolute 
value)  component  of  the  gradient  vector  is  selected  for  descent.  A  randomized  strat¬ 
egy  can  be  also  adapted  in  which  one  randomly  chooses  a  coordinate  to  optimize  in 
each  step;  see  more  discussions  later. 
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Global  Convergence 

It  is  simple  to  prove  global  convergence  for  cyclic  coordinate  descent.  The  algorith¬ 
mic  map  A  is  the  composition  of  2 n  maps 

A  =  SC"SC"_1  ...SC1, 

where  C*(x)  =  (x,  e,-)  with  e,  equal  to  the  ith  unit  vector,  and  S  is  the  usual  line  search 
algorithm  but  over  the  doubly  infinite  line  rather  than  the  semi-infinite  line.  The  map 
C*  is  obviously  continuous  and  S  is  closed.  If  we  assume  that  points  are  restricted 
to  a  compact  set,  then  A  is  closed  by  Corollary  1,  Sect.  7.7.  We  define  the  solution 
set  T  =  {x  :  V/(x)  =  0}.  If  we  impose  the  mild  assumption  on  /  that  a  search 
along  any  coordinate  direction  yields  a  unique  minimum  point,  then  the  function 
Z(x)  =  /(x)  serves  as  a  continuous  descent  function  for  A  with  respect  to  T.  This  is 
because  a  search  along  any  coordinate  direction  either  must  yield  a  decrease  or,  by 
the  uniqueness  assumption,  it  cannot  change  position.  Therefore,  if  at  a  point  x  we 
have  V/(x)  ^  0,  then  at  least  one  component  of  V/(x)  does  not  vanish  and  a  search 
along  the  corresponding  coordinate  direction  must  yield  a  decrease. 


Local  Convergence  Rate 


It  is  difficult  to  compare  the  rates  of  convergence  of  these  algorithms  with  the  rates 
of  others  that  we  analyze.  This  is  partly  because  coordinate  descent  algorithms  are 
from  an  entirely  different  general  class  of  algorithms  than,  for  example,  steepest 
descent  and  Newton’s  method,  since  coordinate  descent  algorithms  are  unaffected 
by  (diagonal)  scale  factor  changes  but  are  affected  by  rotation  of  coordinates — the 
opposite  being  true  for  steepest  descent.  Nevertheless,  some  comparison  is  possible. 

It  can  be  shown  (see  Exercise  20)  that  for  the  same  quadratic  problem  as  treated 
in  Sect.  8.2,  there  holds  for  the  Gauss-Southwell  method 


E(xk+ 1)  < 


a 

A(n-  1) 


E(Xk), 


(8.71) 


where  a,  A  are  as  in  Sect.  8.2  and  n  is  the  dimension  of  the  problem.  Since 


A  -  a\2 
A  +  ci 


a 


n- 1 


A(n  -  1) 


(8.72) 


we  see  that  the  bound  we  have  for  steepest  descent  is  better  than  the  bound  we  have 
for  ft  -  1  applications  of  the  Gauss-Southwell  scheme.  Hence  we  might  argue  that 
it  takes  essentially  n  -  1  coordinate  searches  to  be  as  effective  as  a  single  gradient 
search.  This  is  admittedly  a  crude  guess,  since  (8.54)  is  generally  not  a  tight  bound, 
but  the  overall  conclusion  is  consistent  with  the  results  of  many  experiments.  In¬ 
deed,  unless  the  variables  of  a  problem  are  essentially  uncoupled  from  each  other 


254 


8  Basic  Descent  Methods 


(corresponding  to  a  nearly  diagonal  Hessian  matrix)  coordinate  descent  methods 
seem  to  require  about  n  line  searches  to  equal  the  effect  of  one  step  of  steepest 
descent. 

The  above  discussion  again  illustrates  the  general  objective  that  we  seek  in  con¬ 
vergence  analysis.  By  comparing  the  formula  giving  the  rate  of  convergence  for 
steepest  descent  with  a  bound  for  coordinate  descent,  we  are  able  to  draw  some 
general  conclusions  on  the  relative  performance  of  the  two  methods  that  are  not 
dependent  on  specific  values  of  a  and  A.  Our  analyses  of  local  convergence  proper¬ 
ties,  which  usually  involve  specific  formulae,  are  always  guided  by  this  objective  of 
obtaining  general  qualitative  comparisons. 

Example.  The  quadratic  problem  considered  in  Sect.  8.2  with 

0.78  -0.02  -0.12  -0.14 
_  -0.02  0.86  -0.04  0.06 
-0.12  -0.04  0.72  -0.08 
-0.14  0.06  -0.08  0.74  _ 

b  =  (0.76,  0.08,  1.12,  0.68) 

was  solved  by  the  various  coordinate  search  methods.  The  corresponding  values  of 
the  objective  function  are  shown  in  Table  8.3.  Observe  that  the  convergence  rates 
of  the  three  coordinate  search  methods  are  approximately  equal  but  that  they  all 
converge  about  three  times  slower  than  steepest  descent.  This  is  in  accord  with  the 
estimate  given  above  for  the  Gauss-Southwell  method,  since  in  this  case  n—  1  =  3. 


Convergence  Speed  of  a  Randomized  Coordinate  Descent  Method 

We  now  describe  a  randomized  strategy  in  selecting  xi  in  each  step  of  the  coordinate 
descent  method  for  /  that  is  differentiable  and  Lipschitz  continuous;  that  is,  there 
exist  some  constants  fit  >  0,  i  =  1, . . . ,  n,  such  that 

|V//(x  +  hed  -  V//(x)|  <  m,  VheE,xeE\  (8.73) 

where  V,/(x)  denotes  the  ith  partial  derivative  of  /  at  x,  and  e;-  is  the  ith  unit  vector 
with  the  ith  entry  equal  1  and  everywhere  else  equal  0. 

Randomized  coordinate  decent  method.  Given  an  initial  point  x0;  repeat  for  k  =  0, 1, 2, . . . 

1.  Choose  4  G  {1, . . . ,  n}  randomly  with  a  uniform  distribution. 

2.  Update  xk+i  =  xk  -  ^-Vikf(xk)eik. 

Note  that  after  k  iterations,  the  randomized  coordinate  descent  method  generates  a 
random  sequence  of  xk,  which  depends  on  the  observed  realization  of  the  random 
variable 


ft-  i  =  {ft  6,  •  •  • ,  4-i  }• 
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Table  8.3  Solutions  to  Example 


Iteration  no. 

Value  of  /  for  various  methods 

Gauss-Southwell 

Cyclic 

Double  sweep 

0 

0.0 

0.0 

0.0 

1 

-0.871111 

-0.370256 

-0.370256 

2 

-1.445584 

-0.376011 

-0.376011 

3 

-2.087054 

-1.446460 

-1.446460 

4 

-2.130796 

-2.052949 

-2.052949 

5 

-2.163586 

-2.149690 

-2.060234 

6 

-2.170272 

-2.149693 

-2.060237 

7 

-2.172786 

-2.167983 

-2.165641 

8 

-2.174279 

-2.173169 

-2.165704 

9 

-2.174583 

-2.174392 

-2.168440 

10 

-2.174638 

-2.174397 

-2.173981 

11 

-2.174651 

-2.174582 

-2.174048 

12 

-2.174655 

-2.174643 

-2.174054 

13 

-2.174658 

-2.174656 

-2.174608 

14 

-2.174659 

-2.174656 

-2.174608 

15 

-2.174659 

-2.174658 

-2.174622 

16 

-2.174659 

-2.174655 

17 

-2.174659 

-2.174656 

18 

-2.174656 

19 

-2.174659 

20 

-2.174659 

Theorem  3  (Randomized  Coordinate  Descent — Lipschitz  Convex  Case).  Let  fix )  be 

convex  and  differentiable  everywhere,  satisfy  the  Lipschitz  condition  (8.73),  and  admit  a 
minimizer  x*.  Then,  the  randomized  coordinate  decent  method  generates  a  sequence  of 
solutions  Xk  such  that  for  any  k>  l,  the  iterate  x satisfies 

E&_,  [/(x*)]  -  /(x*)  <ffk  (l|x0  -  xt  +  /(xo)  -  /(x*)) . 
where  Ixl^  =  for  all  x  e  E". 

Proof.  Let  rj  =  \xk  -  x*|2  =  2"=1)8;((x*);  -  **)2  for  any  k  >  0.  Since  x*+i  = 
*k  ~  fSitf(xk)eik,  we  have 

rk+i  =  3  -  2Vikf(xk)((xk)it  -  x*)  +  l(V4/(x,))2. 

Plk 

It  follows  from  (8.73),  Lemma  1,  and  x^+i  =  -  ^-V/,/(x^)e/,  that 

Pifr  K  K 

f(xk+l)  <  f(xk)  +  V kf(xk)((xk+i)h  ~  (xt)4)  +  y((xt+i)4  -  (xk)it)2 

=  /(**)  -  d-(V4/(xO)2. 


(8.74) 
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Combining  the  above  two  relations,  one  has 


4+ 1  <  r2k  -  2 V4/(x,)((x,)4  -  x* )  +  2(/(x*)  -  /(xi+1)) 


Multiplying  both  sides  by  1/2  and  taking  expectation  with  respect  to  4  yields 


E, 


h 


1 
-r 


k+i 


<  lrl  -  “v/(x*)(x*  -  X*)  +  /(x*)  -  E it  [/(Xt+i)] 
2  n 


which  together  with  the  fact  that  V/(x^)(x*  -  x^)  <  /(x*)  -  /(x^)  yields 


E, 


1 


-r 


&+i 


<  \rl  +  +  - — -/W)  -  E,-t  [/(xt+i)] 

2  n  n 


By  rearranging  terms,  we  obtain  that  for  each  k  >  0, 


E, 


h 


\r2k+i  +  f(xk+i)  -  fix*) 


1 


1 


<  |  ^4  +  fixk)  -  fix*)  ]  -  -  ifixk)  -  fix*)) 


n 


Let  /*  =  /(x*).  Then,  taking  expectation  with  respect  to  fy-i  on  both  sides  of  the 
above  relation,  we  have 


\rl+i  +  fi^k+i)  -  f 


< 


U + /(x/o  -  /" 


[/(**)  -  n 


n 


(8.75) 


In  addition,  it  follows  from  (8.74)  that  E^.[/(xJ+i)]  <  E^. _1[/(x;-)]  for  all  j  >  0. 
Using  this  relation  and  applying  the  inequality  (8.75)  recursively,  we  further  obtain 
that 


\m+ 0]  -  r  <  % 


l-r2k+l  +  fixk+i)  -  f 


<\r20+ nx 0)  -  r  - 1 E  (%-■  l/M  -  /*) 


2 

1 


j= o 

<  ^ + /(xo)  -  /*  - — (%  [/(x.+o]  -  /*) 


This  leads  to  the  desired  result  by  moving  the  last  term  on  the  right  to  the  left  side. 


If  /  is  a  strongly  convex  quadratic  function,  the  randomized  coordinate  decent 
method  would  have  an  expected  average  convergence  rate  (1  -  f^).  However,  each 
step  of  the  method  does  ^  amount  of  work  of  the  full  steepest  descent  update;  see 
an  exercise. 
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8.7  Summary 

Most  iterative  algorithms  for  minimization  require  a  line  search  at  every  stage  of 
the  process.  By  employing  any  one  of  a  variety  of  curve  fitting  techniques,  however, 
the  order  of  convergence  of  the  line  search  process  can  be  made  greater  than  unity, 
which  means  that  as  compared  to  the  linear  convergence  that  accompanies  most  full 
descent  algorithms  (such  as  steepest  descent)  the  individual  line  searches  are  rapid. 
Indeed,  in  common  practice,  only  about  three  search  points  are  required  in  any  one 
line  search.  If  the  first  derivatives  are  available,  then  two  search  points  are  required 
(method  of  false  position);  and  if  both  first  and  second  derivatives  are  available,  then 
one  search  point  is  required  (Newton’s  method). 

It  was  also  shown  in  Sect.  8.1  and  the  exercises  that  line  search  algorithms  of 
varying  degrees  of  accuracy  are  all  closed.  Thus  line  searching  is  not  only  rapid 
enough  to  be  practical  but  also  behaves  in  such  a  way  as  to  make  analysis  of  global 
convergence  simple. 

The  most  important  results  of  this  chapter  are  the  arithmetic  convergence  of  the 
method  of  steepest  descent  for  solving  convex  minimization,  the  improved  arith¬ 
metic  convergence  of  the  accelerated  steepest  descent  method,  and  the  geometric 
convergence  of  the  method  for  solving  strongly  convex  minimization.  The  fact  that 
the  method  of  steepest  descent  converges  linearly  with  a  convergence  ratio  equal  to 
[(A  -  a)/ (A  +  a)]2,  where  a  and  A  are,  respectively,  the  smallest  and  largest  eigen¬ 
values  of  the  Hessian  of  the  objective  function  evaluated  at  the  solution  point.  This 
formula,  which  arises  frequently  throughout  the  remainder  of  the  book,  serves  as  a 
fundamental  reference  point  for  other  algorithms.  It  is,  however,  important  to  under¬ 
stand  that  it  is  th z  formula  and  not  its  value  that  serves  as  the  reference.  We  rarely 
advocate  that  the  formula  be  evaluated  since  it  involves  quantities  (namely  eigenval¬ 
ues)  that  are  generally  not  computable  until  after  the  optimal  solution  is  known.  The 
formula  itself,  however,  even  though  its  value  is  unknown,  can  be  used  to  make  sig¬ 
nificant  comparisons  of  the  effectiveness  of  steepest  descent  versus  other  algorithms. 

Newton’s  method  has  order  two  convergence.  However,  it  must  be  modified  to 
insure  global  convergence,  and  evaluation  of  the  Hessian  at  every  point  can  be 
costly.  Nevertheless,  Newton’s  method  provides  another  valuable  reference  point 
in  the  study  of  algorithms,  and  is  frequently  employed  in  interior  point  methods 
using  a  logarithmic  barrier  function. 

As  optimization  problem  sizes  become  bigger  and  bigger,  various  coordinate 
descent  algorithms  are  extremely  popular.  They  are  valuable  especially  in  situations 
where  the  variables  are  essentially  uncoupled  or  there  is  special  structure  that  makes 
searching  in  the  coordinate  directions  particularly  easy.  Typically,  steepest  descent 
can  be  expected  to  be  faster.  Even  if  the  gradient  is  not  directly  available,  it  would 
probably  be  better  to  evaluate  a  finite-difference  approximation  to  the  gradient,  by 
taking  a  single  step  in  each  coordinate  direction,  and  use  this  approximation  in  a 
steepest  descent  algorithm,  rather  than  executing  a  full  line  search  in  each  coordi¬ 
nate  direction. 
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8.8  Exercises 

1.  Show  that  g[a ,  b ,  c ]  defined  by  (8.14)  is  symmetric,  that  is,  interchange  of  the 
arguments  does  not  affect  its  value. 

2.  Prove  (8.14)  and  (8.15). 

Hint :  To  prove  (8.15)  expand  it,  and  subtract  and  add  g'(xk)  to  the  numerator. 

3.  Argue  using  symmetry  that  the  error  in  the  cubic  fit  method  approximately  sat¬ 
isfies  an  equation  of  the  form 


£k+ 1  =  M(s2ksk- 1  +  sks2k_  i) 


and  then  find  the  order  of  convergence. 

4.  What  conditions  on  the  values  and  derivatives  at  two  points  guarantee  that  a 
cubic  polynomial  fit  to  this  data  will  have  a  minimum  between  the  two  points? 
Use  your  answer  to  develop  a  search  scheme,  based  on  cubic  fit,  that  is  globally 
convergent  for  unimodal  functions. 

5.  Using  a  symmetry  argument,  find  the  order  of  convergence  for  a  line  search 
method  that  fits  a  cubic  to  xk-3 ,  xk~2 ,  xk~i,  xk  in  order  to  find  xk+\. 

6.  Consider  the  iterative  process 


_  1  /  a\ 

Xk+ 1  —  0  I  Xk  +  I , 

2  \  xk) 

where  a  >  0.  Assuming  the  process  converges,  to  what  does  it  converge?  What 
is  the  order  of  convergence? 

7.  Suppose  the  continuous  real- valued  function  /  of  a  single  variable  satisfies 

min  f(x)  </(0). 

x^O 

Starting  at  any  v  >  0  show  that,  through  a  series  of  halvings  and  doublings 
of  v  and  evaluation  of  the  corresponding  f(x)’ s,  a  three-point  pattern  can  be 
determined. 

8.  For  5  >  0  define  the  map  by 

Ss(x,  d)  =  {y  :  y  =  x  +  ad,  0  <  a  <  <5;  /( y)  =  min  /(x  +/id)|. 

0<J3<6 

Thus  searches  the  interval  [0,  fi]  for  a  minimum  of  /(x  +  ad),  representing 
a  “limited  range”  line  search.  Show  that  if  /  is  continuous,  is  closed  at  all 
(x,  d). 

9.  For  s  >  0  define  the  map  £S  by 

£S(x,  d)  =  {y  :  y  =  x  +  ad,  a  >  0,  /( y)  <  min  /(x  +  fid)  +  s}. 

Show  that  if  /  is  continuous,  £S  is  closed  at  (x,  d)  if  d  ^  0.  This  map  corre¬ 
sponds  to  an  “inaccurate”  line  search. 
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10.  Referring  to  the  previous  two  exercises,  define  and  prove  a  result  for  £S6. 

1 1 .  Define  S  as  the  line  search  algorithm  that  finds  the  first  relative  minimum  of 
/(x  +  crd)  for  a  >  0.  If  /  is  continuous  and  d  ^  0,  is  S  closed? 

12.  Consider  the  problem 

r\ 

minimize  5x  +  5y  -  xy  -  1 1  v  +  11  y  +  11. 

(a)  Find  a  point  satisfying  the  first-order  necessary  conditions  for  a  solution. 

(b)  Show  that  this  point  is  a  global  minimum. 

(c)  What  would  be  the  rate  of  convergence  of  steepest  descent  for  this  problem? 

(d)  Starting  at  v  =  y  =  0,  how  many  steepest  descent  iterations  would  it  take 
(at  most)  to  reduce  the  function  value  to  10-11? 

13.  Define  the  search  mapping  F  that  determines  the  parameter  a  to  within  a  given 
fraction  c,  0  <  c  <  1?  by 

d 

F(x,  d)  =  {y  :  y  =  x  +  ad,  0  <  a  <  oo,  \a\  <  ca,  where  — /(x  +  rrd)  =  0}. 

da 

Show  that  if  d  ^  0  and  (d/da)f(x  +  ad)  is  continuous,  then  F  is  closed  at  (x,  d). 

14.  Let  ei,  e2, . . .  ,e„  denote  the  eigenvectors  of  the  symmetric  positive  definite 
nxn  matrix  Q.  For  the  quadratic  problem  considered  in  Sect.  8.2,  suppose  x0 
is  chosen  so  that  go  belongs  to  a  subspace  M  spanned  by  a  subset  of  the  e/’s. 
Show  that  for  the  method  of  steepest  descent  gk  £  M  for  all  k.  Find  the  rate  of 
convergence  in  this  case. 

15.  Suppose  we  use  the  method  of  steepest  descent  to  minimize  the  quadratic  func¬ 
tion  /(x)  =  ^(x  -  x*)rQ(x  -  x*)  but  we  allow  a  tolerance  ±6a \  (6  >  0)  in  the 
line  search,  that  is  x^+i  =  x^  -  or* :gk,  where 

(1  -  6)ak  <  ak  <  (1  +  S)ak 

and  ak  minimizes  f(xk  -  agk)  over  a. 

(a)  Find  the  convergence  rate  of  the  algorithm  in  terms  of  a  and  A,  the  smallest 
and  largest  eigenvalues  of  Q,  and  the  tolerance  6. 

Hint :  Assume  the  extreme  case  ak  -  (1  +  <5)ak- 

(b)  What  is  the  largest  6  that  guarantees  convergence  of  the  algorithm?  Explain 
this  result  geometrically. 

(c)  Does  the  sign  of  6  make  any  difference? 

16.  Show  that  for  a  quadratic  objective  function  the  percentage  test  and  the  Gold¬ 
stein  test  are  equivalent. 

17.  Suppose  in  the  method  of  steepest  descent  for  the  quadratic  problem,  the  value 
of  ak  is  not  determined  to  minimize  E(Xk+ 1)  exactly  but  instead  only  satisfies 


E(xk)  -  E(xk+ 1) 


E(xk)  -  E 


E(xk) 


E(xk) 
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for  some  f3,  0  <  p  <  1,  where  E  is  the  value  that  corresponds  to  the  best 
Find  the  best  estimate  for  the  rate  of  convergence  in  this  case. 

18.  Suppose  an  iterative  algorithm  of  the  form  x^+i  =  +  a^k  is  applied  to  the 

quadratic  problem  with  matrix  Q,  where  ak  as  usual  is  chosen  as  the  mini¬ 
mum  point  of  the  line  search  and  where  is  a  vector  satisfying  d*&  <  0  and 
(d[g/.-r  >  /3(d[Qd*)(g[Q_1gt),  where  0  <  [i  <  1 .  This  corresponds  to  a  steep¬ 
est  descent  algorithm  with  “sloppy”  choice  of  direction.  Estimate  the  rate  of 
convergence  of  this  algorithm. 

19.  Repeat  Exercise  18  with  the  condition  on  (d^g^)2  replaced  by 

(d[gt)2  >/i(d[d/c)(g[g<:),  0  <p  <  1. 

20.  Use  the  result  of  Exercise  19  to  derive  (8.71)  for  the  Gauss-Southwell  method. 

21.  Let  f(x,  y)  =  s2  +  y2  +  xy  -  3x. 

(a)  Find  an  unconstrained  local  minimum  point  of  /. 

(b)  Why  is  the  solution  to  (a)  actually  a  global  minimum  point? 

(c)  Find  the  minimum  point  of  /  subject  to  x  >  0,  y  >  0. 

(d)  If  the  method  of  steepest  descent  were  applied  to  (a),  what  would  be  the 
rate  of  convergence  of  the  objective  function? 

22.  Find  an  estimate  for  the  rate  of  convergence  for  the  modified  Newton  method 

X*+l  —  —  CKk^kX  +  F&)  g^ 

given  by  (8.65)  and  (8.66)  when  6  is  larger  than  the  smallest  eigenvalue  of  F(x*). 

23.  Prove  global  convergence  of  the  Gauss-Southwell  method. 

24.  Consider  a  problem  of  the  form 


minimize  /(x) 
subject  to  x  >  0, 

where  x  e  En.  A  gradient- type  procedure  has  been  suggested  for  this  kind  of 
problem  that  accounts  for  the  constraint.  At  a  given  point  x  =  (x\,  X2, . . . ,  xn), 
the  direction  d  =  (d\,  d2,  . . . ,  dn)  is  determined  from  the  gradient  V/(x)r  = 
g  =  (gu  g2,  gn)  by 


i  =  {  -gi  if  xi  >  0  or  gi<  0 
1  (  0  if  Xi  =  0  and  gi  >  0. 

This  direction  is  then  used  as  a  direction  of  search  in  the  usual  manner. 

(a)  What  are  the  first-order  necessary  conditions  for  a  minimum  point  of  this 
problem? 

(b)  Show  that  d,  as  determined  by  the  algorithm,  is  zero  only  at  a  point  satisfy¬ 
ing  the  first-order  conditions. 

(c)  Show  that  if  d  ^  0,  it  is  possible  to  decrease  the  value  of  /  by  movement 
along  d. 
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(d)  If  restricted  to  a  compact  region,  does  the  Global  Convergence  Theorem 
apply?  Why? 

25.  Consider  the  quadratic  problem  and  suppose  Q  has  unity  diagonal.  Consider  a 
coordinate  descent  procedure  in  which  the  coordinate  to  be  searched  is  at  every 
stage  selected  randomly,  each  coordinate  being  equally  likely.  Let  Sk  =  -  x* . 

Assuming  Sk  is  known,  show  that  Qsk+i,  the  expected  value  of  s^+lQsk+i, 
satisfies 


26.  If  the  matrix  Q  has  a  condition  number  of  10,  how  many  iterations  of  steepest 
descent  would  be  required  to  get  six  place  accuracy  in  the  minimum  value  of 
the  objective  function  of  the  corresponding  quadratic  problem? 

27.  Stopping  criterion.  A  question  that  arises  in  using  an  algorithm  such  as  steep¬ 
est  descent  to  minimize  an  objective  function  /  is  when  to  stop  the  iterative 
process,  or,  in  other  words,  how  can  one  tell  when  the  current  point  is  close  to 
a  solution.  If,  as  with  steepest  descent,  it  is  known  that  convergence  is  linear, 
this  knowledge  can  be  used  to  develop  a  stopping  criterion.  Let  {/&}£20  be  the 
sequence  of  values  obtained  by  the  algorithm.  We  assume  that  fk  — >  /*  linearly, 
but  both  /*  and  the  convergence  ratio  are  unknown.  However  we  know  that, 
at  least  approximately, 

fk+i  ~f  =P(fk-f) 

and 

fk-r  =m~i-ry 


These  two  equations  can  be  solved  for  j5  and  f 
(a)  Show  that 


r  = 


fk  ~  fk-lfk+l 


P  = 


fk+ 1  -  fk 


2fk  ~  fk- 1  -  fk+i  fk  ~  fk- 1 

(b)  Motivated  by  the  above  we  form  the  sequence  {f* }  defined  by 


^  fk  fk-lfk+l 
tk  =  2  fk-fk-x~fk+i 


as  the  original  sequence  is  generated.  (This  procedure  of  generating  {f*} 
from  {fk}  is  called  the  Aitken  62 -process.)  If  | fk  -  f*\  =  +  o(J3k)  show 

that  | f£  -  f*  |  =  o(j3k)  which  means  that  { f 'j* }  converges  to  f  *  faster  than 
{fk}  does.  The  iterative  search  for  the  minimum  of  /  can  then  be  terminated 
when  fk  -  fl  is  smaller  than  some  prescribed  tolerance. 

28.  Show  that  the  concordant  requirement  (8.69)  can  be  expressed  as 


d 

dx 


<  1. 
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29.  Assume  fix)  and  g(v)  are  self-concordant.  Show  that  the  following  functions 

are  also  self-concordant. 

(a)  a  fix)  for  a  >  1 

(b)  ax  +  b  +  f{x) 

(c)  f(ax  +  b) 

(d)  f(x)  +  g(x) 

1 .  Prove  Lemma  1 

2.  Consider  convex  quadratic  minimization  with  matrix  Q,  and  let  its  distinct  pos¬ 
itive  eigenvalues  be  Ai,  A2, . . .  Ak.  Then,  if  we  let  the  step  size  in  the  method  of 
steepest  descent  be  ak  =  k  =  1, . . . ,  K,  the  method  terminates  in  K  iterations. 

3.  Show  that  the  randomized  coordinate  descent  method  has  the  expected  average 
convergence  rate  (1  -  ^)  for  solving  strongly  convex  quadratic  programs  where 
a  and  A  are  smallest  and  largest  eigenvalues  of  the  Hessian  matrix. 
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Chapter  9 

Conjugate  Direction  Methods 


Conjugate  direction  methods  can  be  regarded  as  being  somewhat  intermediate 
between  the  method  of  steepest  descent  and  Newton’s  method.  They  are  motivated 
by  the  desire  to  accelerate  the  typically  slow  convergence  associated  with  steepest 
descent  while  avoiding  the  information  requirements  associated  with  the  evaluation, 
storage,  and  inversion  of  the  Hessian  (or  at  least  solution  of  a  corresponding  system 
of  equations)  as  required  by  Newton’s  method. 

Conjugate  direction  methods  invariably  are  invented  and  analyzed  for  the  purely 
quadratic  problem 


where  Q  is  an  n  x  n  symmetric  positive  definite  matrix.  The  techniques  once  worked 
out  for  this  problem  are  then  extended,  by  approximation,  to  more  general  problems; 
it  being  argued  that,  since  near  the  solution  point  every  problem  is  approximately 
quadratic,  convergence  behavior  is  similar  to  that  for  the  pure  quadratic  situation. 

The  area  of  conjugate  direction  algorithms  has  been  one  of  great  creativity 
in  the  nonlinear  programming  field,  illustrating  that  detailed  analysis  of  the  pure 
quadratic  problem  can  lead  to  significant  practical  advances.  Indeed,  conjugate  di¬ 
rection  methods,  especially  the  method  of  conjugate  gradients,  have  proved  to  be 
extremely  effective  in  dealing  with  general  objective  functions  and  are  considered 
among  the  best  general  purpose  methods. 


9.1  Conjugate  Directions 


Definition.  Given  a  symmetric  matrix  Q,  two  vectors  di  and  d2  are  said  to  be  Q -orthogonal, 
or  conjugate  with  respect  to  Q,  if  d[Qd2  =  0. 


©  Springer  International  Publishing  Switzerland  2016  263 

D.G.  Luenberger,  Y.  Ye,  Linear  and  Nonlinear  Programming,  International 
Series  in  Operations  Research  &  Management  Science  228, 

DOI  10. 1007/978-3-3 19-18842-3_9 


264 


9  Conjugate  Direction  Methods 


In  the  applications  that  we  consider,  the  matrix  Q  will  be  positive  definite  but  this 
is  not  inherent  in  the  basic  definition.  Thus  if  Q  =  0,  any  two  vectors  are  conjugate, 
while  if  Q  =  I,  conjugacy  is  equivalent  to  the  usual  notion  of  orthogonality.  A  finite 
set  of  vectors  do,  di,  . . . ,  d^  is  said  to  be  a  Q-orthogonal  set  if  df  Qd  =  0  for  all 
i  ±  j. 

Proposition.  If  Q  is  positive  definite  and  the  set  of  nonzero  vectors  do,  di,  d2,  . . . ,  d^  are 
Q-orthogonal,  then  these  vectors  are  linearly  independent. 

Proof.  Suppose  there  are  constants  or*,  i  =  0, 1, 2,  . . . ,  k  such  that 


trodo  +  •  •  •  +  atkdk  =  0. 


Multiplying  by  Q  and  taking  the  scalar  product  with  d*  yields 


at  djQdf  =  0. 


Or,  since  d7  Qd,  >  0  in  view  of  the  positive  definiteness  of  Q,  we  have  07  =  0. 


i 


Before  discussing  the  general  conjugate  direction  algorithm,  let  us  investigate 
just  why  the  notion  of  Q-orthogonality  is  useful  in  the  solution  of  the  quadratic 
problem 


1  T  T 

minimize  -x  Qx  -  b  x, 

2 


(9.1) 


when  Q  is  positive  definite.  Recall  that  the  unique  solution  to  this  problem  is  also 
the  unique  solution  to  the  linear  equation 


Qx  =  b, 


(9.2) 


and  hence  that  the  quadratic  minimization  problem  is  equivalent  to  a  linear  equation 
problem. 

Corresponding  to  the  nx  n  positive  definite  matrix  Q  let  do,  di,  . . . ,  dw_i  be  n 
nonzero  Q-orthogonal  vectors.  By  the  above  proposition  they  are  linearly  indepen¬ 
dent,  which  implies  that  the  solution  x*  of  (9.1)  or  (9.2)  can  be  expanded  in  terms 
of  them  as 

x*  =  aodo  + - 1-  an-\dn-i  (9.3) 


for  some  set  of  af  s.  In  fact,  multiplying  by  Q  and  then  taking  the  scalar  product 
with  di  yields  directly 


(9.4) 


This  shows  that  the  afi s  and  consequently  the  solution  x*  can  be  found  by  evaluation 
of  simple  scalar  products.  The  end  result  is 


x 


* 


(9.5) 
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There  are  two  basic  ideas  imbedded  in  (9.5).  The  first  is  the  idea  of  selecting 
an  orthogonal  set  of  d/’s  so  that  by  taking  an  appropriate  scalar  product,  all  terms 
on  the  right  side  of  (9.3),  except  the  zth,  vanish.  This  could,  of  course,  have  been 
accomplished  by  making  the  d/’s  orthogonal  in  the  ordinary  sense  instead  of  mak¬ 
ing  them  Q-orthogonal.  The  second  basic  observation,  however,  is  that  by  using 
Q-orthogonality  the  resulting  equation  for  at  can  be  expressed  in  terms  of  the  known 
vector  b  rather  than  the  unknown  vector  x* ;  hence  the  coefficients  can  be  evaluated 
without  knowing  x* . 

The  expansion  for  x*  can  be  considered  to  be  the  result  of  an  iterative  process 
of  n  steps  where  at  the  ith  step  or/d/  is  added.  Viewing  the  procedure  this  way,  and 
allowing  for  an  arbitrary  initial  point  for  the  iteration,  the  basic  conjugate  direction 
method  is  obtained. 

Conjugate  Direction  Theorem.  Let  {d/}^1  be  a  set  of  nonzero  Q-orthogonal  vectors.  For 
any  Xq  g  En  the  sequence  {x^}  generated  according  to 

xk+i  =  xk  +  akdk,  k>  0  (9.6) 


with 


and 


g[d k 

d'Qd, 


gA  =  Qx*  -  b, 


converges  to  the  unique  solution,  x*,  of  Qx  =  b  after  n  steps,  that  is,  xn  =  x*. 


(9.7) 


Proof.  Since  the  d^’s  are  linearly  independent,  we  can  write 


x*  -  xq  -  trodo  +  a idi  +  •  •  •  +  an- \dn-i 


for  some  set  of  afs.  As  we  did  to  get  (9.4),  we  multiply  by  Q  and  take  the  scalar 
product  with  d^  to  find 


d[Q(x*  -xo) 

d[Qd* 


(9.8) 


Now  following  the  iterative  process  (9.6)  from  xq  up  to  x^  gives 


xk  —  xo  -  tfodo  +  ol\ di  +  •  •  •  +  ar^-id^_i, 


and  hence  by  the  Q-orthogonality  of  the  d^’s  it  follows  that 

d[Q(x*  -x0)  =  0. 


(9.9) 


(9.10) 


Substituting  (9.10)  into  (9.8)  produces 

d^Q(x*  -  Xfc)  g[d k 

(r  k  —  TT,  —  Tf,  5 

d[Qd,  d[Qd/f 


which  is  identical  with  (9.7).  I 
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To  this  point  the  conjugate  direction  method  has  been  derived  essentially  through 
the  observation  that  solving  (9.1)  is  equivalent  to  solving  (9.2).  The  conjugate  di¬ 
rection  method  has  been  viewed  simply  as  a  somewhat  special,  but  nevertheless 
straightforward,  orthogonal  expansion  for  the  solution  to  (9.2).  This  viewpoint,  al¬ 
though  important  because  of  its  underlying  simplicity,  ignores  some  of  the  most 
important  aspects  of  the  algorithm;  especially  those  aspects  that  are  important  when 
extending  the  method  to  nonquadratic  problems.  These  additional  properties  are  dis¬ 
cussed  in  the  next  section. 

Also,  methods  for  selecting  or  generating  sequences  of  conjugate  directions  have 
not  yet  been  presented.  Some  methods  for  doing  this  are  discussed  in  the  exer¬ 
cises;  while  the  most  important  method,  that  of  conjugate  gradients,  is  discussed 
in  Sect.  9.3. 


9.2  Descent  Properties  of  the  Conjugate  Direction  Method 

We  define  Sk  as  the  subspace  of  En  spanned  by  {do,  di,  . . . ,  d^_i }.  We  shall  show 
that  as  the  method  of  conjugate  directions  progresses  each  minimizes  the  objec¬ 
tive  over  the  ^-dimensional  linear  variety  xo  +  Sk. 

Expanding  Subspace  Theorem.  Let  be  a  sequence  of  nonzero  Q-orthogonal  vectors 

in  En .  Then  for  any  Xq  e  En  the  sequence  {x^}  generated  according  to 

X&+1  —  x^  + 

g[dt 

(Xk  - - Z - 

d[Qd* 

has  the  property  that  x^  minimizes  fix)  =  ^xrQx  -  brx  on  the  line  x  =  x^-i  +  a&k-i,  -oo  < 
a  <  oo,  as  well  as  on  the  linear  variety  Xo  +  Sk- 

Proof  It  need  only  be  shown  that  x^  minimizes  /  on  the  linear  variety  Xq  +  Sk, 
since  it  contains  the  line  x  =  x^_i  +  ord^_i.  Since  /  is  a  strictly  convex  function, 
the  conclusion  will  hold  if  it  can  be  shown  that  g^  is  orthogonal  to  Sk  (that  is,  the 
gradient  of  /  at  xk  is  orthogonal  to  the  subspace  Sk).  The  situation  is  illustrated  in 
Fig.  9.1.  (Compare  Theorem  2,  Sect.  7.5.) 

We  prove  gk  £>k  by  induction.  Since  So  is  empty  that  hypothesis  is  true  for 
k  =  0.  Assuming  that  it  is  true  for  k ,  that  is,  assuming  g*  _L  Sk,  we  show  that 
g/t+i  -L  Sjfc+i-  We  have 

g&+l  =  g  k  +  akQdk,  (9.13) 

and  hence 

d[g*+i  =  dkSk  +  akdlQdk  =  0  (9.14) 

by  definition  of  ak.  Also  for  i  <  k 


(9.11) 

(9.12) 


df  g*+i  =  d  J  gk  +  akdj  Qdk. 


(9.15) 
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The  first  term  on  the  right-hand  side  of  (9.15)  vanishes  because  of  the  induction 
hypothesis,  while  the  second  vanishes  by  the  Q-orthogonality  of  the  d/’s.  Thus 

gA+l  -L  (3k+ 1 .  I 


Fig.  9.1  Conjugate  direction  method 

Corollary.  In  the  method  of  conjugate  directions  the  gradients  g*,  k  =  0, 1,  . . .  ,n  satisfy 

gl  dj  =  0  for  i  <  k. 

The  above  theorem  is  referred  to  as  the  Expanding  Subspace  Theorem,  since  the 
£*’s  form  a  sequence  of  subspaces  with  3k+\  d  3k.  Since  minimizes  /  over 
xq  +  3k ,  it  is  clear  that  xn  must  be  the  overall  minimum  of  /. 


Fig.  9.2  Interpretation  of  expanding  subspace  theorem 
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To  obtain  another  interpretation  of  this  result  we  again  introduce  the  function 

E(x)  =  i(x  -  x*)rQ(x  -  x*)  (9.16) 

as  a  measure  of  how  close  the  vector  x  is  to  the  solution  x*.  Since  E(x)  = 
/(x)  +  (l/2)x*TQx*  the  function  E  can  be  regarded  as  the  objective  that  we  seek 
to  minimize. 

By  considering  the  minimization  of  E  we  can  regard  the  original  problem  as  one 
of  minimizing  a  generalized  distance  from  the  point  x*.  Indeed,  if  we  had  Q  =  I, 
the  generalized  notion  of  distance  would  correspond  (within  a  factor  of  two)  to  the 
usual  Euclidean  distance.  For  an  arbitrary  positive-definite  Q  we  say  E  is  a  general¬ 
ized  Euclidean  metric  or  distance  function.  Vectors  d i  =  0,1,  . . . ,  n-  1  that  are 
Q-orthogonal  may  be  regarded  as  orthogonal  in  this  generalized  Euclidean  space 
and  this  leads  to  the  simple  interpretation  of  the  Expanding  Subspace  Theorem  il¬ 
lustrated  in  Fig.  9.2.  For  simplicity  we  assume  xo  =  0.  In  the  figure  is  shown  as 
being  orthogonal  to  Sk  with  respect  to  the  generalized  metric.  The  point  x^  mini¬ 
mizes  E  over  Sk  while  x^+i  minimizes  E  over  Sk+\.  The  basic  property  is  that,  since 
dk  is  orthogonal  to  Sk,  the  point  x^+i  can  be  found  by  minimizing  E  along  d^  and 
adding  the  result  to  xk. 


9.3  The  Conjugate  Gradient  Method 

The  conjugate  gradient  method  is  the  conjugate  direction  method  that  is  obtained  by 
selecting  the  successive  direction  vectors  as  a  conjugate  version  of  the  successive 
gradients  obtained  as  the  method  progresses.  Thus,  the  directions  are  not  specified 
beforehand,  but  rather  are  determined  sequentially  at  each  step  of  the  iteration.  At 
step  k  one  evaluates  the  current  negative  gradient  vector  and  adds  to  it  a  linear  com¬ 
bination  of  the  previous  direction  vectors  to  obtain  a  new  conjugate  direction  vector 
along  which  to  move. 

There  are  three  primary  advantages  to  this  method  of  direction  selection.  First, 
unless  the  solution  is  attained  in  less  than  n  steps,  the  gradient  is  always  nonzero 
and  linearly  independent  of  all  previous  direction  vectors.  Indeed,  the  gradient  gk 
is  orthogonal  to  the  subspace  Sk  generated  by  do,  di,  . . . ,  d^-i.  If  the  solution  is 
reached  before  n  steps  are  taken,  the  gradient  vanishes  and  the  process  terminates — 
it  being  unnecessary,  in  this  case,  to  find  additional  directions. 

Second,  a  more  important  advantage  of  the  conjugate  gradient  method  is  the 
especially  simple  formula  that  is  used  to  determine  the  new  direction  vector.  This 
simplicity  makes  the  method  only  slightly  more  complicated  than  steepest  descent. 

Third,  because  the  directions  are  based  on  the  gradients,  the  process  makes  good 
uniform  progress  toward  the  solution  at  every  step.  This  is  in  contrast  to  the  situation 
for  arbitrary  sequences  of  conjugate  directions  in  which  progress  may  be  slight  until 
the  final  few  steps.  Although  for  the  pure  quadratic  problem  uniform  progress  is  of 
no  great  importance,  it  is  important  for  generalizations  to  nonquadratic  problems. 


9.3  The  Conjugate  Gradient  Method 

Conjugate  Gradient  Algorithm 


Starting  at  any  xq  €  En  define  do  =  -go  =  b  -  Qx0  and 


X/c+l  —  +  (%k&k 

g[d/t 

Ctk  = - 7 - 

d[Qd, 

djfc+i  =  -gi+i  +  Adi 

_  gI+lQd/, 

A  d[Qd;  ’ 
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(9.17) 

(9.18) 

(9.19) 

(9.20) 


where  gk  =  Qxk  -  b. 

In  the  algorithm  the  first  step  is  identical  to  a  steepest  descent  step;  each  succeed¬ 
ing  step  moves  in  a  direction  that  is  a  linear  combination  of  the  current  gradient  and 
the  preceding  direction  vector.  The  attractive  feature  of  the  algorithm  is  the  simple 
formulae,  (9.19)  and  (9.20),  for  updating  the  direction  vector.  The  method  is  only 
slightly  more  complicated  to  implement  than  the  method  of  steepest  descent  but 
converges  in  a  finite  number  of  steps. 


Verification  of  the  Algorithm 

To  verify  that  the  algorithm  is  a  conjugate  direction  algorithm,  it  is  necessary  to 
verify  that  the  vectors  {d^}  are  Q- orthogonal.  It  is  easiest  to  prove  this  by  simulta¬ 
neously  proving  a  number  of  other  properties  of  the  algorithm.  This  is  done  in  the 
theorem  below  where  the  notation  [do,  di,  . . . ,  dj  is  used  to  denote  the  subspace 
spanned  by  the  vectors  do,  di,  . . . ,  d^. 

Conjugate  Gradient  Theorem.  The  conjugate  gradient  algorithm  (9.17) — (9.20)  is  a  conju¬ 
gate  direction  method.  If  it  does  not  terminate  at  x*,  then 

a)  [go,  gi,  . . . ,  g*]  =  [go,  Qg0,  ■  ■  ■ ,  Q*go] 

b)  [do,  di,  . . . ,  d*]  =  [go,  Qg0,  . . . ,  Q'go I 

c)  d^Qd;  =  0  for 

d)  ak  =  glgkl&lQ&k 

e)  fik  —  §&+i  /§£  §&• 

Proof.  We  first  prove  (a),  (b)  and  (c)  simultaneously  by  induction.  Clearly,  they  are 
true  for  k  =  0.  Now  suppose  they  are  true  for  k,  we  show  that  they  are  true  for  k  +  1 . 
We  have 

gk+i  =  g  k  +  otkQdk. 

By  the  induction  hypothesis  both  gk  and  Qd^  belong  to  [go,  Qg0,  . . . ,  Q*+1goL  the 
first  by  (a)  and  the  second  by  (b).  Thus  g^+i  e  [go,  Qg0,  • . QA+1goL  Further¬ 
more  gk+i  £  [go,  Qg0,  Q*go]  =  [do,  di,  ...,  dk]  since  otherwise  gfe+i  =  0, 
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because  for  any  conjugate  direction  method  g^+i  is  orthogonal  to  [do,  di,  . . . ,  dj. 
(The  induction  hypothesis  on  (c)  guarantees  that  the  method  is  a  conjugate  direction 
method  up  to  xjt+i.)  Thus,  finally  we  conclude  that 

[go,  gi,  •••,  gt+i]  =  [go,  Qg0,  •••,  Q/c+lgol, 


which  proves  (a). 

To  prove  (b)  we  write 

dfc+i  =  -gk+\  + 

and  (b)  immediately  follows  from  (a)  and  the  induction  hypothesis  on  (b). 

Next,  to  prove  (c)  we  have 

d[+1Qd,  =  -g[+1Qd,  +Ad[Qdr 

For  i  =  k  the  right  side  is  zero  by  definition  of  /^.  For  i  <  k  both  terms  vanish. 
The  first  term  vanishes  since  Qd,  e  [di,  d2,  . ..,  d;+i],  the  induction  hypothe¬ 
sis  which  guarantees  the  method  is  a  conjugate  direction  method  up  to  x^+i,  and 
by  the  Expanding  Subspace  Theorem  that  guarantees  that  g^+i  is  orthogonal  to 
[do,  di,  . . . ,  d/+i].  The  second  term  vanishes  by  the  induction  hypothesis  on  (c). 
This  proves  (c),  which  also  proves  that  the  method  is  a  conjugate  direction  method. 
To  prove  (d)  we  have 

-g[d/f  =  g[ g/c  -  pk- 1  g[  d/-_  i , 

and  the  second  term  is  zero  by  the  Expanding  Subspace  Theorem. 

Finally,  to  prove  (e)  we  note  that  g[+lgk  =  0,  because  g^  e  [do,  . . . ,  d^]  and  g^+i 
is  orthogonal  to  [do,  . . . ,  dj.  Thus  since 

1 

Q dk  =  —(gk+i  ~  g k), 

&k 


we  have 

g[+ 1  Qd/c  =  §£+  \gk+\’  ^ 

®k 

Parts  (a)  and  (b)  of  this  theorem  are  a  formal  statement  of  the  interrelation  between 
the  direction  vectors  and  the  gradient  vectors.  Part  (c)  is  the  equation  that  verifies 
that  the  method  is  a  conjugate  direction  method.  Parts  (d)  and  (e)  are  identities 
yielding  alternative  formulae  for  ak  and  (3k  that  are  often  more  convenient  than  the 
original  ones. 


9.4  The  C-G  Method  as  an  Optimal  Process 

We  turn  now  to  the  description  of  a  special  viewpoint  that  leads  quickly  to  some 
very  profound  convergence  results  for  the  method  of  conjugate  gradients.  The  basis 
of  the  viewpoint  is  part  (b)  of  the  Conjugate  Gradient  Theorem.  This  result  tells  us 
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the  spaces  &k  over  which  we  successively  minimize  are  determined  by  the  original 
gradient  go  and  multiplications  of  it  by  Q.  Each  step  of  the  method  brings  into 
consideration  an  additional  power  of  Q  times  go.  It  is  this  observation  we  exploit. 

Let  us  consider  a  new  general  approach  for  solving  the  quadratic  minimization 
problem.  Given  an  arbitrary  starting  point  xo,  let 


X*+1  =  X0  +  Pk(  Q)go,  (9.21) 

where  Pk  is  a  polynomial  of  degree  k.  Selection  of  a  set  of  coefficients  for  each  of 
the  polynomials  Pi-  determines  a  sequence  of  xks.  We  have 


x*+i  -  x*  =  x0  -  x*  +  Pk( Q)Q(x0  -  x*) 

=  [I  +  Q^(Q)](x0  -  X*),  (9.22) 


and  hence 

E(xk+l)  =  ~(x/.+i  -  x*)rQ(xt.+i  -  x*) 

=  I(x0  -  x*)rQ[I  +  Q^(Q)]2(x0  -  x*).  (9.23) 

We  may  now  pose  the  problem  of  selecting  the  polynomial  Pk  in  such  a  way  as  to 
minimize  E(xk+ 1)  with  respect  to  all  possible  polynomials  of  degree  k.  Expanding 
(9.21),  however,  we  obtain 


Xfc+1  =  X0  +  rogo  +  n  Qg0  +  •  •  •  +  ytQAgo,  (9.24) 


where  the  y?  s  are  the  coefficients  of  P k.  In  view  of 


&k+ 1  =  [do,  di,  . . . ,  dk  \  =  [go,  Qg0,  . . . ,  QAgo], 

the  vector  x^+i  =  x0  +  crodo  +  aqdi  + . . .  +  a^d*;  generated  by  the  method  of  conjugate 
gradients  has  precisely  this  form;  moreover,  according  to  the  Expanding  Subspace 
Theorem,  the  coefficients  yz  determined  by  the  conjugate  gradient  process  are  such 
as  to  minimize  E(Xk+ 1).  Therefore,  the  problem  posed  of  selecting  the  optimal  Pk  is 
solved  by  the  conjugate  gradient  procedure. 

The  explicit  relation  between  the  optimal  coefficients  yi  of  Pk  and  the  constants 
at,  Pi  associated  with  the  conjugate  gradient  method  is,  of  course,  somewhat  com¬ 
plicated,  as  is  the  relation  between  the  coefficients  of  Pk  and  those  of  Pk+\ .  The 
power  of  the  conjugate  gradient  method  is  that  as  it  progresses  it  successively  solves 
each  of  the  optimal  polynomial  problems  while  updating  only  a  small  amount  of 
information. 

We  summarize  the  above  development  by  the  following  very  useful  theorem. 
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Theorem  1.  The  point  x^+i  generated  by  the  conjugate  gradient  method  satisfies 

E(xk+ 1)  =  min  i(x0  -  x*)rQ[I  +  Qf-*(Q)]2(x0  -  x*),  (9.25) 

Pk  2 

where  the  minimum  is  taken  with  respect  to  all  polynomials  Pk  of  degree  L 


Bounds  on  Convergence 

To  use  Theorem  1  most  effectively  it  is  convenient  to  recast  it  in  terms  of  eigenvec¬ 
tors  and  eigenvalues  of  the  matrix  Q.  Suppose  that  the  vector  x0  -  x*  is  written  in 
the  eigenvector  expansion 


XQ-X*  =  f\£\  +  ^2^2  +  •  •  •  + 

where  the  e/’s  are  normalized  eigenvectors  of  Q.  Then  since  Q(xo  -  x*)  =  + 

+  . . .  +  An£nen  and  since  the  eigenvectors  are  mutually  orthogonal,  we  have 

1  ]  n 

E(x o)  =  -(x0  -  x*)rQ(x0  “  x*)  =  2  2j  (9-26) 

^  1=1 

where  the  Af  s  are  the  corresponding  eigenvalues  of  Q.  Applying  the  same  manipu¬ 
lations  to  (9.25),  we  find  that  for  any  polynomial  Pk  of  degree  k  there  holds 

1  n 

E(xk+ 1)  <  -  Vi [J  + 

i=  1 

It  then  follows  that 

1  n 

E(xk+i)  <  max[l  +  AjPk(Aj)]2-  V  A$, 

Ai  2tt' 

and  hence  finally 

E(xk+ 1)  <  max[l  +  AiPk(Ai)]2E(x0). 

A 

We  summarize  this  result  by  the  following  theorem. 

Theorem  2.  In  the  method  of  conjugate  gradients  we  have 

E(xk+ 1)  <  max[l  +  AjPAAfifEix 0)  (9.27) 

Ai 

for  any  polynomial  Pk  of  degree  k,  where  the  maximum  is  taken  over  all  eigenvalues  A[ 

of  Q 

This  way  of  viewing  the  conjugate  gradient  method  as  an  optimal  process  is  ex¬ 
ploited  in  the  next  section.  We  note  here  that  it  implies  the  far  from  obvious  fact  that 
every  step  of  the  conjugate  gradient  method  is  at  least  as  good  as  a  steepest  descent 
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step  would  be  from  the  same  point.  To  see  this,  suppose  has  been  computed  by 
the  conjugate  gradient  method.  From  (9.24)  we  know  has  the  form 


X*  =  X0  +  yogo  +  nQgo  +  •  •  •  +  7k- iQk  ‘go- 


Now  if  x^+i  is  computed  from  x^  by  steepest  descent,  then  x^+i  =  x^  -  akgk  for  some 
ak.  In  view  of  part  (a)  of  the  Conjugate  Gradient  Theorem  x^+i  will  have  the  form 
(9.24).  Since  for  the  conjugate  direction  method  E(xk+ 1)  is  lower  than  any  other  x^+i 
of  the  form  (9.24),  we  obtain  the  desired  conclusion. 

Typically  when  some  information  about  the  eigenvalue  structure  of  Q  is  known, 
that  information  can  be  exploited  by  construction  of  a  suitable  polynomial  Pk  to  use 
in  (9.27).  Suppose,  for  example,  it  were  known  that  Q  had  only  m  <  n  distinct  eigen¬ 
values.  Then  it  is  clear  that  by  suitable  choice  of  Pm-  i  it  would  be  possible  to  make 
the  rath  degree  polynomial  1  +APm-\(X)  have  its  ra  zeros  at  the  ra  eigenvalues.  Using 
that  particular  polynomial  in  (9.27)  shows  that  E(xm)  =  0.  Thus  the  optimal  solution 
will  be  obtained  in  at  most  ra,  rather  than  n ,  steps.  More  sophisticated  examples  of 
this  type  of  reasoning  are  contained  in  the  next  section  and  in  the  exercises  at  the 
end  of  the  chapter. 


9.5  The  Partial  Conjugate  Gradient  Method 


A  collection  of  procedures  that  are  natural  to  consider  at  this  point  are  those  in 
which  the  conjugate  gradient  procedure  is  carried  out  for  ra  +  1  <  n  steps  and  then, 
rather  than  continuing,  the  process  is  restarted  from  the  current  point  and  ra  +  1 
more  conjugate  gradient  steps  are  taken.  The  special  case  of  ra  =  0  corresponds 
to  the  standard  method  of  steepest  descent,  while  m  -  n  -  1  corresponds  to  the 
full  conjugate  gradient  method.  These  partial  conjugate  gradient  methods  are  of 
extreme  theoretical  and  practical  importance,  and  their  analysis  yields  additional 
insight  into  the  method  of  conjugate  gradients.  The  development  of  the  last  section 
forms  the  basis  of  our  analysis. 

As  before,  given  the  problem 


(9.28) 


we  define  for  any  point  x^  the  gradient  g£  =  Qx^-b.  We  consider  an  iteration  scheme 
of  the  form 


Xfc+1  =Xk  +  Pk(Q)gk, 


(9.29) 


where  Pk  is  a  polynomial  of  degree  ra.  We  select  the  coefficients  of  the  polynomial 
Pk  so  as  to  minimize 


(9.30) 
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where  x*  is  the  solution  to  (9.28).  In  view  of  the  development  of  the  last  section,  it 
is  clear  that  x^+i  can  be  found  by  taking  m  +  1  conjugate  gradient  steps  rather  than 
explicitly  determining  the  appropriate  polynomial  directly.  (The  sequence  indexing 
is  slightly  different  here  than  in  the  previous  section,  since  now  we  do  not  give 
separate  indices  to  the  intermediate  steps  of  this  process.  Going  from  x^  to  x^+i  by 
the  partial  conjugate  gradient  method  involves  m  other  points.) 

The  results  of  the  previous  section  provide  a  tool  for  convergence  analysis  of 
this  method.  In  this  case,  however,  we  develop  a  result  that  is  of  particular  interest 
for  Q’s  having  a  special  eigenvalue  structure  that  occurs  frequently  in  optimization 
problems,  especially,  as  shown  below  and  in  Chap.  12,  in  the  context  of  penalty 
function  methods  for  solving  problems  with  constraints.  We  imagine  that  the  eigen¬ 
values  of  Q  are  of  two  kinds:  there  are  m  large  eigenvalues  that  may  or  may  not  be 
located  near  each  other,  and  n  -  m  smaller  eigenvalues  located  within  an  interval 
[ a ,  b\.  Such  a  distribution  of  eigenvalues  is  shown  in  Fig.  9.3. 

As  an  example,  consider  as  in  Sect.  8.3  the  problem  on  En 


1  7~T  rp 

minimize  -x  Ox  -  b  x 
2 
rp 

subject  to  c  x  =  0, 


n  -  m  eigenvalues 

I - 1 - FFH - H 

0  a  h 

Fig.  9.3  Eigenvalue  distribution 


m  large  eigenvalues 


where  Q  is  a  symmetric  positive  definite  matrix  with  eigenvalues  in  the  interval 
[ a ,  A]  and  b  and  c  are  vectors  in  En.  This  is  a  constrained  problem  but  it  can  be 
approximated  by  the  unconstrained  problem 


1  1 

X  nr  nr  X  np 

minimize  -x  Ox  -  b  x  +  -p( c  x) 
2  2 


2 


where  //  is  a  large  positive  constant.  The  last  term  in  the  objective  function  is  called 
a  penalty  term ;  for  large  p  minimization  with  respect  to  x  will  tend  to  make  cTx 
small. 

The  total  quadratic  term  in  the  objective  is  ^xr(Q  +  pccT)x ,  and  thus  it  is  appro¬ 
priate  to  consider  the  eigenvalues  of  the  matrix  Q  +  pccT .  As  p  tends  to  infinity  it 
can  be  shown  (see  Chap.  13)  that  one  eigenvalue  of  this  matrix  tends  to  infinity  and 
the  other  n—  1  eigenvalues  remain  bounded  within  the  original  interval  [ a ,  A]. 

As  noted  before,  if  steepest  descent  were  applied  to  a  problem  with  such  a  struc¬ 
ture,  convergence  would  be  governed  by  the  ratio  of  the  smallest  to  largest  eigen¬ 
value,  which  in  this  case  would  be  quite  unfavorable.  In  the  theorem  below  it  is 
stated  that  by  successively  repeating  m  +  1  conjugate  gradient  steps  the  effects  of  the 
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m  largest  eigenvalues  are  eliminated  and  the  rate  of  convergence  is  determined  as  if 
they  were  not  present.  A  computational  example  of  this  phenomenon  is  presented  in 
Sect.  13.5.  The  reader  may  find  it  interesting  to  read  that  section  right  after  this  one. 

Theorem  (Partial  Conjugate  Gradient  Method).  Suppose  the  symmetric  positive  definite 
matrix  Q  has  n-m  eigenvalues  in  the  interval  [a,  b ],  a  >  0  and  the  remaining  m  eigenvalues 
are  greater  than  b.  Then  the  method  of  partial  conjugate  gradients,  restarted  every  m  +  1 
steps,  satisfies 

£(x£+i)<[^ — -j  E(xk).  (9.31) 

\b  +  a] 

(The  point  xk+i  is  found  from  xk  by  taking  m  +  1  conjugate  gradient  steps  so  that  each 
increment  in  k  is  a  composite  of  several  simple  steps.) 


Proof.  Application  of  (9.27)  yields 

E(xk+l  <  max[l  +  A,P(A,)]2E(xk)  (9.32) 

Ai 

for  any  mth-order  polynomial  P,  where  the  Af  s  are  the  eigenvalues  of  Q.  Let  us 
select  P  so  that  the  (m  +  l)th-degree  polynomial  q(A)  =  1  +  AP(A)  vanishes  at 
(a  +  b)l 2  and  at  the  m  large  eigenvalues  of  Q.  This  is  illustrated  in  Fig.  9.4.  For  this 
choice  of  P  we  may  write  (9.32)  as 

E(xk+ 1)  <  max  [1  +  AjP(Aj)]2E(xk). 


Since  the  polynomial  q(A)  =  1  +  AP(A)  has  m+  1  real  roots,  q'(A)  will  have  m  real 
roots  which  alternate  between  the  roots  of  q(A)  on  the  real  axis.  Likewise,  q"(A) 
will  have  m  -  1  real  roots  which  alternate  between  the  roots  of  q' (A).  Thus,  since 
q(A)  has  no  root  in  the  interval  (-oo,  (a  +  b)/ 2),  we  see  that  q"(A)  does  not  change 
sign  in  that  interval;  and  since  it  is  easily  verified  that  q"(0)  >  0  it  follows  that  q(A) 
is  convex  for  A  <  (a  +  b)/ 2.  Therefore,  on  [0,  (a  +  b)/ 2],  q(A)  lies  below  the  line 
1  -  [24/ (a  +  b)].  Thus  we  conclude  that 


q(A)  <  l  - 


2  A 


a  +  b 


on  [0,  (a  +  b)/2]  and  that 


a  +  b\  2 
q  i - I  >  - 


a  +  b 


We  can  see  that  on  [(a  +  b)/ 2,  b] 


q(A)  >  1  - 


2  A 

a  +  b’ 


since  for  q(A)  to  cross  first  the  line  1  -  [24/ ( a  +  b)]  and  then  the  4-axis  would  require 
at  least  two  changes  in  sign  of  q"(A ),  whereas,  at  most  one  root  of  q"(A)  exists  to 
the  left  of  the  second  root  of  q(A).  We  see  then  that  the  inequality 
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\1  +  AP(A)\  <  \l - 


is  valid  on  the  interval  [a,  b].  The  final  result  (9.31)  follows  immediately.  I 


/ 


^  A 


a  +  h 


Fig.  9.4  Construction  for  proof 

In  view  of  this  theorem,  the  method  of  partial  conjugate  gradients  can  be  regarded 
as  a  generalization  of  steepest  descent,  not  only  in  its  philosophy  and  implementa¬ 
tion,  but  also  in  its  behavior.  Its  rate  of  convergence  is  bounded  by  exactly  the  same 
formula  as  that  of  steepest  descent  but  with  the  largest  eigenvalues  removed  from 
consideration.  (It  is  worth  noting  that  for  m  =  0  the  above  proof  provides  a  simple 
derivation  of  the  Steepest  Descent  Theorem.) 


9.6  Extension  to  Nonquadratic  Problems 

The  general  unconstrained  minimization  problem  on  En 


minimize  /(x) 


can  be  attacked  by  making  suitable  approximations  to  the  conjugate  gradient  al¬ 
gorithm.  There  are  a  number  of  ways  that  this  might  be  accomplished;  the  choice 
depends  partially  on  what  properties  of  /  are  easily  computable.  We  look  at  three 
methods  in  this  section  and  another  in  the  following  section. 


Quadratic  Approximation 

In  the  quadratic  approximation  method  we  make  the  following  associations  at  x^: 


g k  V/(xlt)7', 


Q  <->  F(x/{), 
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and  using  these  associations,  reevaluated  at  each  step,  all  quantities  necessary  to 
implement  the  basic  conjugate  gradient  algorithm  can  be  evaluated.  If  /  is  quadratic, 
these  associations  are  identities,  so  that  the  general  algorithm  obtained  by  using 
them  is  a  generalization  of  the  conjugate  gradient  scheme.  This  is  similar  to  the 
philosophy  underlying  Newton’s  method  where  at  each  step  the  solution  of  a  general 
problem  is  approximated  by  the  solution  of  a  purely  quadratic  problem  through  these 
same  associations. 

When  applied  to  nonquadratic  problems,  conjugate  gradient  methods  will  not 
usually  terminate  within  n  steps.  It  is  possible  therefore  simply  to  continue  finding 
new  directions  according  to  the  algorithm  and  terminate  only  when  some  termina¬ 
tion  criterion  is  met.  Alternatively,  the  conjugate  gradient  process  can  be  interrupted 
after  n  or  n  +  1  steps  and  restarted  with  a  pure  gradient  step.  Since  Q-conjugacy  of 
the  direction  vectors  in  the  pure  conjugate  gradient  algorithm  is  dependent  on  the 
initial  direction  being  the  negative  gradient,  the  restarting  procedure  seems  to  be 
preferred.  We  always  include  this  restarting  procedure.  The  general  conjugate  gra¬ 
dient  algorithm  is  then  defined  as  below. 

Step  1.  Starting  at  xo  compute  go  =  V/(x o)r  and  set  do  =  -go- 
Step  2.  For  A;  =  0, 1,  ...,  n—  1: 

(a)  Set  x*+i  =  x*  +  ak&k  where  ak  =  rF(^)dt . 

(b)  Compute  gk+l  =  Vf(xk+i)T. 

(c)  Unless  k  =  n  -  1,  set  d*+i  =  -gk+\  +  /3kdk  where 


g[+iF(xOdfc 

d[F(x,)d* 


and  repeat  (a). 

Step  3.  Replace  x0  by  xn  and  go  back  to  Step  1 . 

An  attractive  feature  of  the  algorithm  is  that,  just  as  in  the  pure  form  of  Newton’s 
method,  no  line  searching  is  required  at  any  stage.  Also,  the  algorithm  converges 
in  a  finite  number  of  steps  for  a  quadratic  problem.  The  undesirable  features  are 
that  F(x^)  must  be  evaluated  at  each  point,  which  is  often  impractical,  and  that  the 
algorithm  is  not,  in  this  form,  globally  convergent. 


Line  Search  Methods 

It  is  possible  to  avoid  the  direct  use  of  the  association  Q  F(x^).  First,  instead 
of  using  the  formula  for  a >  in  Step  2(a)  above,  ak  is  found  by  a  line  search  that 
minimizes  the  objective.  This  agrees  with  the  formula  in  the  quadratic  case.  Second, 
the  formula  for  / %  in  Step  2(c)  is  replaced  by  a  different  formula,  which  is,  however, 
equivalent  to  the  one  in  2(c)  in  the  quadratic  case. 
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The  first  such  method  proposed  was  the  Fletcher-Reeves  method ,  in  which  Part 
(e)  of  the  Conjugate  Gradient  Theorem  is  employed;  that  is, 

_  Sfc+jgfc+i 

Pk  j1 

The  complete  algorithm  (using  restarts)  is: 

Step  1.  Given  xo  compute  go  =  V/(x o)r  and  set  do  =  -go- 
Step  2.  For  k  =  0,  1,  . . . ,  n  -  1: 

(a)  Set  X£+i  =  Xk  +  o^d*;  where  minimizes  /(x*  +  crd^). 

(b)  Compute  g*+i  =  V/(x^+1)r. 

(c)  Unless  k  =  n  -  1,  set  d^+i  =  -g*+i  +  /fed*  where 


Pk  = 


>fc+i8fc+i 

8&  8& 


Step  S.  Replace  xo  by  xn  and  go  back  to  Step  1 . 

Another  important  method  of  this  type  is  the  Polak-Ribiere  method ,  where 


(gfc+i  ~  gk)Tgk+i 


is  used  to  determine  /^.  Again  this  leads  to  a  value  identical  to  the  standard  for¬ 
mula  in  the  quadratic  case.  Experimental  evidence  seems  to  favor  the  Polak-Ribiere 
method  over  other  methods  of  this  general  type. 


Convergence 

Global  convergence  of  the  line  search  methods  is  established  by  noting  that  a  pure 
steepest  descent  step  is  taken  every  n  steps  and  serves  as  a  spacer  step.  Since  the 
other  steps  do  not  increase  the  objective,  and  in  fact  hopefully  they  decrease  it, 
global  convergence  is  assured.  Thus  the  restarting  aspect  of  the  algorithm  is  impor¬ 
tant  for  global  convergence  analysis,  since  in  general  one  cannot  guarantee  that  the 
directions  d^  generated  by  the  method  are  descent  directions. 

The  local  convergence  properties  of  both  of  the  above,  and  most  other,  non¬ 
quadratic  extensions  of  the  conjugate  gradient  method  can  be  inferred  from  the 
quadratic  analysis.  Assuming  that  at  the  solution,  x*,  the  matrix  F(x*)  is  positive 
definite,  we  expect  the  asymptotic  convergence  rate  per  step  to  be  at  least  as  good 
as  steepest  descent,  since  this  is  true  in  the  quadratic  case.  In  addition  to  this  bound 
on  the  single  step  rate  we  expect  that  the  method  is  of  order  two  with  respect  to 
each  complete  cycle  of  n  steps.  In  other  words,  since  one  complete  cycle  solves  a 
quadratic  problem  exactly  just  as  Newton’s  method  does  in  one  step,  we  expect  that 
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for  general  nonquadratic  problems  there  will  hold  \xk+n  -  x*|  <  c|x^  -  x*|2  for  some 
c  and  k  =  0,  n.  In ,  3  n,  ....  This  can  indeed  be  proved,  and  of  course  underlies  the 
original  motivation  for  the  method.  For  problems  with  large  n ,  however,  a  result  of 
this  type  is  in  itself  of  little  comfort,  since  we  probably  hope  to  terminate  in  fewer 
than  n  steps.  Further  discussion  on  this  general  topic  is  contained  in  Sect.  10.4. 


Scaling  and  Partial  Methods 

Convergence  of  the  partial  conjugate  gradient  method,  restarted  every  m  +  1  steps, 
will  in  general  be  linear.  The  rate  will  be  determined  by  the  eigenvalue  structure 
of  the  Hessian  matrix  F(x*),  and  it  may  be  possible  to  obtain  fast  convergence 
by  changing  the  eigenvalue  structure  through  scaling  procedures.  If,  for  example, 
the  eigenvalues  can  be  arranged  to  occur  in  m  +  1  bunches,  the  rate  of  the  partial 
method  will  be  relatively  fast.  Other  structures  can  be  analyzed  by  use  of  Theorem  2, 
Sect.  9.4,  by  using  F(x*)  rather  than  Q. 


*9.7  *  Parallel  Tangents 

In  early  experiments  with  the  method  of  steepest  descent  the  path  of  descent  was 
noticed  to  be  highly  zig-zag  in  character,  making  slow  indirect  progress  toward  the 
solution.  (This  phenomenon  is  now  quite  well  understood  and  is  predicted  by  the 
convergence  analysis  of  Sect.  8.2.)  It  was  also  noticed  that  in  two  dimensions  the  so¬ 
lution  point  often  lies  close  to  the  line  that  connects  the  zig-zag  points,  as  illustrated 
in  Fig.  9.5.  This  observation  motivated  the  accelerated  gradient  method  in  which 
a  complete  cycle  consists  of  taking  two  steepest  descent  steps  and  then  searching 
along  the  line  connecting  the  initial  point  and  the  point  obtained  after  the  two  gra¬ 
dient  steps.  The  method  of  parallel  tangents  (PARTAN)  was  developed  through  an 


Fig.  9.5  Path  of  gradient  method 
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attempt  to  extend  this  idea  to  an  acceleration  scheme  involving  all  previous  steps. 
The  original  development  was  based  largely  on  a  special  geometric  property  of  the 
tangents  to  the  contours  of  a  quadratic  function,  but  the  method  is  now  recognized 
as  a  particular  implementation  of  the  method  of  conjugate  gradients,  and  this  is  the 
context  in  which  it  is  treated  here. 

The  algorithm  is  defined  by  reference  to  Fig.  9.6.  Starting  at  an  arbitrary  point  xo 
the  point  xi  is  found  by  a  standard  steepest  descent  step.  After  that,  from  a  point  x^ 
the  corresponding  is  first  found  by  a  standard  steepest  descent  step  from  x^,  and 
then  Xk+i  is  taken  to  be  the  minimum  point  on  the  line  connecting  x^_i  and  y&.  The 
process  is  continued  for  n  steps  and  then  restarted  with  a  standard  steepest  descent 
step. 

Notice  that  except  for  the  first  step,  x^+i  is  determined  from  x^,  not  by  searching 
along  a  single  line,  but  by  searching  along  two  lines.  The  direction  connecting 
two  successive  points  (indicated  as  dotted  lines  in  the  figure)  is  thus  determined 
only  indirectly.  We  shall  see,  however,  that,  in  the  case  where  the  objective  function 
is  quadratic,  the  d^’s  are  the  same  directions,  and  the  x/2s  are  the  same  points,  as 
would  be  generated  by  the  method  of  conjugate  gradients. 

PARTAN  Theorem.  For  a  quadratic  function,  PARTAN  is  equivalent  to  the  method  of  con¬ 
jugate  gradients. 


x 

Fig.  9.6  PARTAN 


Fig.  9.7  One  step  of  PARTAN 

Proof.  The  proof  is  by  induction.  It  is  certainly  true  of  the  first  step,  since  it  is  a 
steepest  descent  step.  Suppose  that  x0,  Xi,  . . . ,  x^  have  been  generated  by  the  con¬ 
jugate  gradient  method  and  x^+i  is  determined  according  to  PARTAN.  This  single 
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step  is  shown  in  Fig.  9.7.  We  want  to  show  that  x^+i  is  the  same  point  as  would 
be  generated  by  another  step  of  the  conjugate  gradient  method.  For  this  to  be  true 
Xk+i  must  be  that  point  which  minimizes  /  over  the  plane  defined  by  d^_i  and 
gk  =  V/(XyOr.  From  the  theory  of  conjugate  gradients,  this  point  will  also  minimize 
/  over  the  subspace  determined  by  gk  and  all  previous  d/’  s.  Equivalently,  we  must 
find  the  point  x  where  V/(x)  is  orthogonal  to  both  g^  and  d^_i.  Since  y^  minimizes 
/  along  gk,  we  see  that  V/( y*)  is  orthogonal  to  gk.  Since  V/(x^_i)  is  contained  in 
the  subspace  [do,  di,  . . . ,  d^_i]  and  because  gk  is  orthogonal  to  this  subspace  by 
the  Expanding  Subspace  Theorem,  we  see  that  V/(x^_i)  is  also  orthogonal  to  gk. 
Since  V/(x)  is  linear  in  x,  it  follows  that  at  every  point  x  on  the  line  through  Xk-i 
and  yk  we  have  V/(x)  orthogonal  to  gk.  By  minimizing  /  along  this  line,  a  point 
X£+i  is  obtained  where  in  addition  V/(x^+i)  is  orthogonal  to  the  line.  Thus  V/(x^+i) 
is  orthogonal  to  both  g^  and  the  line  joining  x^_i  and  y^.  It  follows  that  V/(x^+i)  is 
orthogonal  to  the  plane.  I 

There  are  advantages  and  disadvantages  of  PARTAN  relative  to  other  methods 
when  applied  to  nonquadratic  problems.  One  attractive  feature  of  the  algorithm  is 
its  simplicity  and  ease  of  implementation.  Probably  its  most  desirable  property,  how¬ 
ever,  is  its  strong  global  convergence  characteristics.  Each  step  of  the  process  is  at 
least  as  good  as  steepest  descent;  since  going  from  Xk  to  y^  is  exactly  steepest  de¬ 
scent,  and  the  additional  move  to  x^+i  provides  further  decrease  of  the  objective 
function.  Thus  global  convergence  is  not  tied  to  the  fact  that  the  process  is  restarted 
every  n  steps.  It  is  suggested,  however,  that  PARTAN  should  be  restarted  every  n 
steps  (ox  n  +  1  steps)  so  that  it  will  behave  like  the  conjugate  gradient  method  near 
the  solution. 

An  undesirable  feature  of  the  algorithm  is  that  two  line  searches  are  required  at 
each  step,  except  the  first,  rather  than  one  as  is  required  by,  say,  the  Fletcher-Reeves 
method.  This  is  at  least  partially  compensated  by  the  fact  that  searches  need  not 
be  as  accurate  for  PARTAN,  for  while  inaccurate  searches  in  the  Fletcher-Reeves 
method  may  yield  nonsensical  successive  search  directions,  PARTAN  will  at  least 
do  as  well  as  steepest  descent. 

9.8  Exercises 

1.  Let  Q  be  a  positive  definite  symmetric  matrix  and  suppose  po,  Pi,  . . . ,  pn-i 
are  linearly  independent  vectors  in  En .  Show  that  a  Gram-Schmidt  procedure 
can  be  used  to  generate  a  sequence  of  Q-conjugate  directions  from  the  p,’s. 
Specifically,  show  that  do,  di, . . . ,  dw_i  defined  recursively  by 


do  =  Po 


form’s  a  Q-conjugate  set. 
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2.  Suppose  the  p/’s  in  Exercise  1  are  generated  as  moments  of  Q,  that  is,  suppose 
P£  =  QAPo,  k  =  1,2,  ...,  n—  1.  Show  that  the  corresponding  d^’s  can  then 
be  generated  by  a  (three-term)  recursion  formula  where  d^+i  is  defined  only  in 
terms  of  Qd„  dfc  and  d^-i. 

3.  Suppose  the  p/3s  in  Exercise  1  are  taken  as  p^  =  where  is  the  kth  unit 
coordinate  vector  and  the  d^’s  are  constructed  accordingly.  Show  that  using  d/3s 
in  a  conjugate  direction  method  to  minimize  (l/2)xrQx  -  brx  is  equivalent  to 
the  application  of  Gaussian  elimination  to  solve  Qx  =  b. 

4.  Let  fix)  =  (l/2)xrQx  -  brx  be  defined  on  En  with  Q  positive  definite.  Let 
xi  be  a  minimum  point  of  /  over  a  subspace  of  En  containing  the  vector  d 
and  let  X2  be  the  minimum  of  /  over  another  subspace  containing  d.  Suppose 
/(x i)  <  /(x 2).  Show  that  xi  -  X2  is  Q-conjugate  to  d. 

5.  Let  Q  be  a  symmetric  matrix.  Show  that  any  two  eigenvectors  of  Q,  correspond¬ 
ing  to  distinct  eigenvalues,  are  Q-conjugate. 

6.  Let  Q  be  an  n  x  n  symmetric  matrix  and  let  do,  di,  . . . ,  dn-\  be  Q-conjugate. 
Show  how  to  find  an  E  such  that  ErQE  is  diagonal. 

7.  Show  that  in  the  conjugate  gradient  method  Qd^  e  &k+ 1. 

8.  Derive  the  rate  of  convergence  of  the  method  of  steepest  descent  by  viewing  it 
as  a  one-step  optimal  process. 

9.  Let  Pk{ Q)  =  Co  +  ciQ  +  C2Q2  + - 1-  cmQm  be  the  optimal  polynomial  in  (9.29) 

minimizing  (9.30).  Show  that  the  cf  s  can  be  found  explicitly  by  solving  the 
vector  equation 


g[  Qg<  g[Q2g/c  •  •  •  g%Qm+1gk  1 

CO  ' 

g[g  k 

g[Q  V  g[Q3g/c  •  •  •  g[Q"'+2g<: 

Cl 

_  g[Qg k 

-g[Qm+1g*  •••  g[Q2m+1gt- 

Cm  _ 

g[Qmg  k. 

Show  that  this  reduces  to  steepest  descent  when  m  -  0. 

10.  Show  that  for  the  method  of  conjugate  directions  there  holds 

£(Xo)’ 

where  y  =  a/ A  and  a  and  A  are  the  smallest  and  largest  eigenvalues  of  Q.  Hint : 
In  (9.27)  select  Pk-i(' 1)  so  that 


1  +  APk-i(A)  - 


References 


283 


where  Tk(A)  =  cos (k  arc  cos  A)  is  the  kth  Chebyshev  polynomial.  This  choice 
gives  the  minimum  maximum  magnitude  on  [a,  A] .  Verify  and  use  the  inequality 

(l  - yf  <  fl-  Vx\* 

(i+  Vr)2<:  +  (i-  Vr)2i  "U+  Vr/  ' 

1 1 .  Suppose  it  is  known  that  each  eigenvalue  of  Q  lies  either  in  the  interval  [a,  A] 
or  in  the  interval  [a  +  A,  A  +  A]  where  a,  A,  and  A  are  all  positive.  Show  that  the 
partial  conjugate  gradient  method  restarted  every  two  steps  will  converge  with 
a  ratio  no  greater  than  [(A  -  a) /(A  +  a)]2  no  matter  how  large  A  is. 

12.  Modify  the  first  method  given  in  Sect.  9.6  so  that  it  is  globally  convergent. 

13.  Show  that  in  the  purely  quadratic  form  of  the  conjugate  gradient  method 
djQd^  =  -djQg^.  Using  this  show  that  to  obtain  x^+i  from  it  is  necessary 
to  use  Q  only  to  evaluate  g k  and  Qg^. 

14.  Show  that  in  the  quadratic  problem  Qgk  can  be  evaluated  by  taking  a  unit  step 
from  Xk  in  the  direction  of  the  negative  gradient  and  evaluating  the  gradient 
there.  Specifically,  if  yk=xk-  gk  and  p/v  =  V/( yk)T,  then  Qg,  =  g k-  pk. 

15.  Combine  the  results  of  Exercises  13  and  14  to  derive  a  conjugate  gradient 
method  for  general  problems  much  in  the  spirit  of  the  first  method  of  Sect.  9.6 
but  which  does  not  require  knowledge  of  F(x^)  or  a  line  search. 
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Chapter  10 

Quasi-Newton  Methods 


In  this  chapter  we  take  another  approach  toward  the  development  of  methods  lying 
somewhere  intermediate  to  steepest  descent  and  Newton’s  method.  Again  working 
under  the  assumption  that  evaluation  and  use  of  the  Hessian  matrix  is  impractical  or 
costly,  the  idea  underlying  quasi-Newton  methods  is  to  use  an  approximation  to  the 
inverse  Hessian  in  place  of  the  true  inverse  that  is  required  in  Newton’s  method.  The 
form  of  the  approximation  varies  among  different  methods — ranging  from  the  sim¬ 
plest  where  it  remains  fixed  throughout  the  iterative  process,  to  the  more  advanced 
where  improved  approximations  are  built  up  on  the  basis  of  information  gathered 
during  the  descent  process. 

The  quasi-Newton  methods  that  build  up  an  approximation  to  the  inverse  Hessian 
are  analytically  the  most  sophisticated  methods  discussed  in  this  book  for  solving 
unconstrained  problems  and  represent  the  culmination  of  the  development  of  algo¬ 
rithms  through  detailed  analysis  of  the  quadratic  problem.  As  might  be  expected, 
the  convergence  properties  of  these  methods  are  somewhat  more  difficult  to  dis¬ 
cover  than  those  of  simpler  methods.  Nevertheless,  we  are  able,  by  continuing  with 
the  same  basic  techniques  as  before,  to  illuminate  their  most  important  features. 

In  the  course  of  our  analysis  we  develop  two  important  generalizations  of  the 
method  of  steepest  descent  and  its  corresponding  convergence  rate  theorem.  The 
first,  discussed  in  Sect.  10.1,  modifies  steepest  descent  by  taking  as  the  direction 
vector  a  positive  definite  transformation  of  the  negative  gradient.  The  second,  dis¬ 
cussed  in  Sect.  10.8,  is  a  combination  of  steepest  descent  and  Newton’s  method. 
Both  of  these  fundamental  methods  have  convergence  properties  analogous  to  those 
of  steepest  descent. 
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10.1  Modified  Newton  Method 

A  very  basic  iterative  process  for  solving  the  problem 

minimize  /(x) 

which  includes  as  special  cases  most  of  our  earlier  ones  is 

X*+1  =  X*  -  akSkSf(xk)T  (10.1) 

where  is  a  symmetric  nxn  matrix  and  where,  as  usual,  ak  is  chosen  to  minimize 
/(Xfc+i).  If  Sk  is  the  inverse  of  the  Hessian  of  /,  we  obtain  Newton’s  method,  while 
if  Sk  =  I  we  have  steepest  descent.  It  would  seem  to  be  a  good  idea,  in  general, 
to  select  as  an  approximation  to  the  inverse  of  the  Hessian.  We  examine  that 
philosophy  in  this  section. 

First,  we  note,  as  in  Sect.  8.5,  that  in  order  that  the  process  (10.1)  be  guaranteed 
to  be  a  descent  method  for  small  values  of  a,  it  is  necessary  in  general  to  require 
that  be  positive  definite.  We  shall  therefore  always  impose  this  as  a  requirement. 

Because  of  the  similarity  of  the  algorithm  (10.1)  with  steepest  descent1  it  should 
not  be  surprising  that  its  convergence  properties  are  similar  in  character  to  our  ear¬ 
lier  results.  We  derive  the  actual  rate  of  convergence  by  considering,  as  usual,  the 
standard  quadratic  problem  with 


/(x)  =  pOx  -  1/x.  (10.2) 

where  Q  is  symmetric  and  positive  definite.  For  this  case  we  can  find  an  explicit 
expression  for  ak  in  (10.1).  The  algorithm  becomes 

xk+{  =xk-  akSk gk,  (10.3a) 

where 

g k  =  Qx*  “  b 

£k$kgk 

ak  =  —7; - . 

gkSkQSkgk 

We  may  then  derive  the  convergence  rate  of  this  algorithm  by  slightly  extending  the 
analysis  carried  out  for  the  method  of  steepest  descent. 

Modified  Newton  Method  Theorem  (Quadratic  Case).  Let  x*  be  the  uniqueminimum  point 
of  f,  and  define  E(x)  =  l(x  -  x*)rQ(x  -  x*). 


(10.3b) 

(10.3c) 


1  The  algorithm  (10.1)  is  sometimes  referred  to  as  the  method  of  deflected  gradients ,  since  the 
direction  vector  can  be  thought  of  as  being  determined  by  deflecting  the  gradient  through  multipli¬ 
cation  by  Sfc. 
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Then  for  the  algorithm  (10.3)  there  holds  at  every  step  k 

I  Bk-  bk  \2 

fr  (io.4) 

\Bk+bkJ 

where  bk  and  Bk  are,  respectively,  the  smallest  and  largest  eigenvalues  ofthematrix  S^Q. 
Proof.  We  have  by  direct  substitution 


g(xft)  -  £(xt+i)  _  (g[S^gA.)2 

E(xk)  ~  (g[S,QS,gi)(g[Q-|g,)' 

Letting  T*  =  SfQSf  and  p/,  =  Sfgk  we  obtain 

E(xk)  -  E(xk+l)  _  (P [PQ2 

E(xk)  ~  (pTkTkpk)(pTkT-klVky 

From  the  Kantorovich  inequality  we  obtain  easily 

/ Bk-bk\ 2 

E(*k+ 1)  <  - ~r  E(xk), 

\Bk  +  bk) 

where  bk  and  Bk  are  the  smallest  and  largest  eigenvalues  of  TV  Since  S^/2T^S~1/2  = 
S^Q,  we  see  that  S^Q  is  similar  to  and  therefore  has  the  same  eigenvalues.  I 

This  theorem  supports  the  intuitive  notion  that  for  the  quadratic  problem  one 
should  strive  to  make  close  to  Q-1  since  then  both  bk  and  Bk  would  be  close 
to  unity  and  convergence  would  be  rapid.  For  a  nonquadratic  objective  function  / 
the  analog  to  Q  is  the  Hessian  F(x),  and  hence  one  should  try  to  make  close 
to  FtXfc)-1. 

Two  remarks  may  help  to  put  the  above  result  in  proper  perspective.  The  first  re¬ 
mark  is  that  both  the  algorithm  (10.1)  and  the  theorem  stated  above  are  only  simple, 
minor,  and  natural  extensions  of  the  work  presented  in  Chap.  8  on  steepest  descent. 
As  such  the  result  of  this  section  can  be  regarded,  correspondingly,  not  as  a  new  idea 
but  as  an  extension  of  the  basic  result  on  steepest  descent.  The  second  remark  is  that 
this  one  simple  result  when  properly  applied  can  quickly  characterize  the  conver¬ 
gence  properties  of  some  fairly  complex  algorithms.  Thus,  rather  than  an  isolated 
result  concerned  with  a  specific  form  of  algorithm,  the  theorem  above  should  be 
regarded  as  a  general  tool  for  convergence  analysis.  It  provides  significant  insight 
into  various  quasi-Newton  methods  discussed  in  this  chapter. 


A  Classical  Method 

We  conclude  this  section  by  mentioning  the  classical  modified  Newton  s  method , 
a  standard  method  for  approximating  Newton’s  method  without  evaluating  F(x^)-1 
for  each  k.  We  set 


288 


10  Quasi-Newton  Methods 


X^+|  =  \k  -  cn-[F(x0)]  1V/(xlfc)7'.  (10.5) 

In  this  method  the  Hessian  at  the  initial  point  xo  is  used  throughout  the  process. 
The  effectiveness  of  this  procedure  is  governed  largely  by  how  fast  the  Hessian  is 
changing — in  other  words,  by  the  magnitude  of  the  third  derivatives  of  /. 


10.2  Construction  of  the  Inverse 

The  fundamental  idea  behind  most  quasi-Newton  methods  is  to  try  to  construct  the 
inverse  Hessian,  or  an  approximation  of  it,  using  information  gathered  as  the  descent 
process  progresses.  The  current  approximation  is  then  used  at  each  stage  to  de¬ 
fine  the  next  descent  direction  by  setting  Sk  =  in  the  modified  Newton  method. 
Ideally,  the  approximations  converge  to  the  inverse  of  the  Hessian  at  the  solution 
point  and  the  overall  method  behaves  somewhat  like  Newton’s  method.  In  this  sec¬ 
tion  we  show  how  the  inverse  Hessian  can  be  built  up  from  gradient  information 
obtained  at  various  points. 

Let  /  be  a  function  on  En  that  has  continuous  second  partial  derivatives.  If  for 
two  points  Xfc+1,  xk  we  define  gk+l  =  Vf(xk+i)T,  gk  =  V  f(xk)r  and  p/  =  xk+l  -  xk, 
then 

Sfc+1  -  Sk  -  F(x*)p*-  (10.6) 

If  the  Hessian,  F,  is  constant,  then  we  have 

q/c  =  8/t+i  -  g/c  =  FP/c’  (10.7) 

and  we  see  that  evaluation  of  the  gradient  at  two  points  gives  information  about  F.  If 
n  linearly  independent  directions  po,  Pi,  p2, . . . ,  pn_]  and  the  corresponding  q^’s  are 
known,  then  F  is  uniquely  determined.  Indeed,  letting  P  and  Q  be  the  nxn  matrices 
with  columns  p^  and  respectively,  we  have  F  =  QP' 

It  is  natural  to  attempt  to  construct  successive  approximations  to  F-1  based 
on  data  obtained  from  the  first  k  steps  of  a  descent  process  in  such  a  way  that  if 
F  were  constant  the  approximation  would  be  consistent  with  (10.7)  for  these  steps. 
Specifically,  if  F  were  constant  H^+i  would  satisfy 

Hyt+iq;  =  p  p  0  <k.  (10.8) 

After  n  linearly  independent  steps  we  would  then  have  H„  =  F-1. 

For  any  k  <  n  the  problem  of  constructing  a  suitable  H^,  with  in  general  serves  as 
an  approximation  to  the  inverse  Hessian  and  which  in  the  case  of  constant  F  satis¬ 
fies  (10.8),  admits  an  infinity  of  solutions,  since  there  are  more  degrees  of  freedom 
than  there  are  constraints.  Thus  a  particular  method  can  take  into  account  addi¬ 
tional  considerations.  We  discuss  below  one  of  the  simplest  schemes  that  has  been 
proposed. 
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Rank  One  Correction 


Since  F  and  F  1  are  symmetric,  it  is  natural  to  require  that  H&,  the  approximation  to 
F_1,  be  symmetric.  We  investigate  the  possibility  of  defining  a  recursion  of  the  form 

H^+i  —  +  cLfcZkZfc ,  (10.9) 


which  preserves  symmetry.  The  vector  and  the  constant  ak  define  a  matrix  of  (at 
most)  rank  one,  by  which  the  approximation  to  the  inverse  is  updated.  We  select 
them  so  that  (10.8)  is  satisfied.  Setting  i  equal  to  k  in  (10.8)  and  substituting  (10.9) 
we  obtain 

P,  =  H/,-+ 1  q;  =  H,q,  +  akzk  z[qk.  (10.10) 

Taking  the  inner  product  with  we  have 

=  afizIqt)  (io.ii) 

On  the  other  hand,  using  (10.10)  we  may  write  (10.9)  as 


Hy^+l  =  Hk  + 


(P*  -  H*q*)(P*  -  H*q*): 


ak 


(z[q t)‘ 


which  in  view  of  (10.1 1)  leads  finally  to 


H^+i  =  H  k  + 


(Jh  ~  H^kKPfc  ~  Wyiif 

q l  (Pt  -  H/  q,  ) 


(10.12) 


We  have  determined  what  a  rank  one  correction  must  be  if  it  is  to  satisfy  (10.8) 
for  i  -  k.  It  remains  to  be  shown  that,  for  the  case  where  F  is  constant,  (10.8)  is  also 
satisfied  for  i  <  k.  This  in  turn  will  imply  that  the  rank  one  recursion  converges  to 
F-1  after  at  most  n  steps. 


Theorem.  Let  F  be  a  fixed  symmetric  matrix  and  suppose  that  po,  P\,  P2,  •  •  • ,  Pk  are  given 
vectors.  Define  the  vectors  q,  =  Fpt,  i  =  0, 1, 2, ... ,  k. 

Starting  with  any  initial  symmetric  matrix  H{)  let 


Hl+ 1  =  Hi  + 


<Pi  ~  H,ql)(pl  -  HflX 
<lf  (Pi  ~  H,qj) 


(10.13) 


Then 

Pi  =  Hk+iqi  for  i  <  k. 


(10.14) 


Proof.  The  proof  is  by  induction.  Suppose  it  is  true  for  H^,  and  i  <  k  -  1.  The 
relation  was  shown  above  to  be  true  for  H^+i  and  i  -  k.  For  i  <  k 


H/  +  iq,  =  H/q,  +  y„(p[q,  -  q[H/q,), 


(10.15) 
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where 

=  (P k  -  H/'-q;.-) 
y*  q[(p/:  -  H^.q<:)' 

By  the  induction  hypothesis,  (10.15)  becomes 

Ut+iq,-  =  p,  +  yk  (p[q,  -  q[p,) . 

From  the  calculation 

q'p,  =  p'Fp,  =  p[q«> 

it  follows  that  the  second  term  vanishes.  I 

To  incorporate  the  approximate  inverse  Hessian  in  a  descent  procedure  while 
simultaneously  improving  it,  we  calculate  the  direction  From 

ak  =  -nkgk 

and  then  minimize  /(x^  +  adk)  with  respect  to  a  >  0.  This  determines  x^+i  = 
Xk  +  ak dk,  pA  =  ak dk,  and  gk+i.  Then  H^+i  can  be  calculated  according  to  (10.12). 

There  are  some  difficulties  with  this  simple  rank  one  procedure.  First,  the  up¬ 
dating  formula  (10.12)  preserves  positive  definiteness  only  if  q^(p^  -  H^q*)  >  0, 
which  cannot  be  guaranteed  (see  Exercise  6).  Also,  even  if  qj  (p*.  -  H^q^)  is  pos¬ 
itive,  it  may  be  small,  which  can  lead  to  numerical  difficulties.  Thus,  although  an 
excellent  simple  example  of  how  information  gathered  during  the  descent  process 
can  in  principle  be  used  to  update  an  approximation  to  the  inverse  Hessian,  the  rank 
one  method  possesses  some  limitations. 


10.3  Davidon-Fletcher-Powell  Method 

The  earliest,  and  certainly  one  of  the  most  clever  schemes  for  constructing  the  in¬ 
verse  Hessian,  was  originally  proposed  by  Davidon  and  later  developed  by  Fletcher 
and  Powell.  It  has  the  fascinating  and  desirable  property  that,  for  a  quadratic  ob¬ 
jective,  it  simultaneously  generates  the  directions  of  the  conjugate  gradient  method 
while  constructing  the  inverse  Hessian.  At  each  step  the  inverse  Hessian  is  updated 
by  the  sum  of  two  symmetric  rank  one  matrices,  and  this  scheme  is  therefore  often 
referred  to  as  a  rank  two  correction  procedure.  The  method  is  also  often  referred  to 
as  the  variable  metric  method ,  the  name  originally  suggested  by  Davidon. 

The  procedure  is  this:  Starting  with  any  symmetric  positive  definite  matrix  Ho, 
any  point  xo,  and  with  k-  0, 

Step  1.  Set  dk  =  -H*g*. 

Step  2.  Minimize  f(xk  +  adk)  with  respect  to  a  >  0  to  obtain  x^+i,  p^  =  ak  dk, 
and  g*+i. 

Step  3.  Set  q^  =  gk+i  ~  2>k  and 
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Hfc+l  =  Hfc  + 


P/--P[ 

p  *q* 


H/q;q[H; 

q[  H/q/ 


(10.16) 


Update  k  and  return  to  Step  1 . 


Positive  Definiteness 


We  first  demonstrate  that  if  is  positive  definite,  then  so  is  H^+i.  For  any  x  e  E 
we  have 

(xrp,)2  (x^H.q,)2 


n 


X  H*+ix  =  x  Htx  + 


p[qt  q[H/q< 


Defining  a  =  H'/2x,  b  =  H J/2q/t  we  may  rewrite  (10.17)  as 


(10.17) 


r„  (a'  a)(b  b)  -  (a'  b) 

x  11/  \  = - 


T u\2  /„r_  \2 


(b'b) 


+ 


(*  pp- 

P/'q/.- 


We  also  have 


p[q k  =  p[ g/c+ 1  -  p[  gr  =  -p[g<:- 


(10.18) 


since 


pf&t+t  =  °- 


(10.19) 


because  x^+i  is  the  minimum  point  of  /  along  p^.  Thus  by  definition  of 


p[q/.-  =  akgTknkgk, 


(10.20) 


and  hence 


(ay  a)(V  b)  -  (arb)2  ( x 1  p^) 


rT„  \2 


XrHyt+1X  = 


(brb) 


+ 


(10.21) 


Both  terms  on  the  right  of  (10.21)  are  nonnegative — the  first  by  the  Cauchy- 
Schwarz  inequality.  We  must  only  show  they  do  not  both  vanish  simultaneously. 
The  first  term  vanishes  only  if  a  and  b  are  proportional.  This  in  turn  implies  that  x 
and  qk  are  proportional,  say  x  =  / 3qk .  In  that  case,  however, 


P kx=PPk<lk  =  t  0 

from  (10.20).  Thus  xTHk+\x  >  0  for  all  nonzero  x. 

It  is  of  interest  to  note  that  in  the  proof  above  the  fact  that  is  chosen  as  the 
minimum  point  of  the  line  search  was  used  in  (10.19),  which  led  to  the  important 
conclusion  pjq^  >  0.  Actually  any  ak,  whether  the  minimum  point  or  not,  that 
gives  p^q^  >  0  can  be  used  in  the  algorithm,  and  H^+i  will  be  positive  definite  (see 
Exercises  8  and  9). 
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Finite  Step  Convergence 

We  assume  now  that  /  is  quadratic  with  (constant)  Hessian  F.  We  show  in  this 
case  that  the  Davidon-Fletcher-Powell  method  produces  direction  vectors  that  are 
F-orthogonal  and  that  if  the  method  is  carried  n  steps  then  Hn  =  F_1. 

Theorem.  If  f  is  quadratic  with  positive  definite  Hessian  F,  then  for  the  Davidon-Fletcher- 
Powell  method 


p]Fpj  =  0,  0  <  i  <  j  <  k  (10.22) 

Hk+iFpj  =  Pi  for  0  <  i  <  k.  (10.23) 


Proof.  We  note  that  for  the  quadratic  case 


Qt  =  g/t+i  -  g  k=  Fx*+i  -  Fx*  =  FPfc-  (10.24) 


Also 


H^+iFp^  =  H^+iq*.  =  p^  (10.25) 

from  (10.16). 

We  now  prove  (10.22)  and  (10.23)  by  induction.  From  (10.25)  we  see  that  they 
are  true  for  k  =  0.  Assuming  they  are  true  for  k  -  1,  we  prove  they  are  true  for  k.  We 
have 

g  k  =  &+i + f(p/+i  +  •  •  • + p*_i)- 
Therefore  from  (10.22)  and  (10.19) 

P  r  %k  =  Pf  g/+i  =0  for  ()<i<  k.  (10.26) 


Hence  from  (10.23) 

pf  mkgk  =  o. 

Thus  since  p*  =  -atk and  since  F  0,  we  obtain 


pfFp*.  =  0  for  i  <  k. 


(10.27) 

(10.28) 


which  proves  (10.22)  for  k. 

Now  since  from  (10.23)  for  k—  1,  (10.24)  and  (10.28) 

q[  H/  Fp,  =  q[  p,  =  p[Fp,  =0,  0  <  i  <  k 

we  have 

H/  +  |Fp,  =  H,Fp,  =  p„  0  <i<k. 

This  together  with  (10.25)  proves  (10.23)  for  k.  I 

Since  the  p^’s  are  F-orthogonal  and  since  we  minimize  /  successively  in  these 
directions,  we  see  that  the  method  is  a  conjugate  direction  method.  Furthermore, 
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if  the  initial  approximation  Ho  is  taken  equal  to  the  identity  matrix,  the  method 
becomes  the  conjugate  gradient  method.  In  any  case  the  process  obtains  the  overall 
minimum  point  within  n  steps. 

Finally,  (10.23)  shows  that  po,  p1?  p2,  . . . ,  Pa  are  eigenvectors  corresponding  to 
unity  eigenvalue  for  the  matrix  H^+iF.  These  eigenvectors  are  linearly  independent, 
since  they  are  F-orthogonal,  and  therefore  Hn  =  F_1. 


10.4  The  Broyden  Family 

The  updating  formulae  for  the  inverse  Hessian  considered  in  the  previous  two  sec¬ 
tions  are  based  on  satisfying 


Hfc+iq^  =  Pp  0  <i<k,  (10.29) 

which  is  derived  from  the  relation 

q,  =  Fpz,  0  <i<k,  (10.30) 

which  would  hold  in  the  purely  quadratic  case.  It  is  also  possible  to  update  ap¬ 
proximations  to  the  Hessian  F  itself,  rather  than  its  inverse.  Thus,  denoting  the  kth 

approximation  of  F  by  B^,  we  would,  analogously,  seek  to  satisfy 


q,  =  Bfc+1pz-,  0  <i<k. 


(10.31) 


Equation  (10.31)  has  exactly  the  same  form  as  (10.29)  except  that  q*  and  p*  are 
interchanged  and  H  is  replaced  by  B.  It  should  be  clear  that  this  implies  that  any 
update  formula  for  H  derived  to  satisfy  (10.29)  can  be  transformed  into  a  corre¬ 
sponding  update  formula  for  B.  Specifically,  given  any  update  formula  for  H,  the 
complementary  formula  is  found  by  interchanging  the  roles  of  B  and  H  and  of  q 
and  p.  Likewise,  any  updating  formula  for  B  that  satisfies  (10.31)  can  be  converted 
by  the  same  process  to  a  complementary  formula  for  updating  H.  It  is  easily  seen 
that  taking  the  complement  of  a  complement  restores  the  original  formula. 

To  illustrate  complementary  formulae,  consider  the  rank  one  update  of  Sect.  10.2, 
which  is 


H^+i  =  + 


(P k  -  H/--q<.XPt  -  H/,-q;c)r 

q[(p *  -  Ht-q*) 


(10.32) 


The  corresponding  complementary  formula  is 


Bfc+l  =  Byt  + 


(q*  -  B/  P/  )<q;  -  B/:p;,)7' 

pjt  (q*  -  B/p,) 


(10.33) 
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Likewise,  the  Davidon-Fletcher-Powell  (or  simply  DFP)  formula  is 


H 


DFP 
k+ 1 


qlH/q; 


and  its  complement  is 


B/t+i  =  + 


B/P/p'B/ 
p[  BkPk 


(10.34) 


(10.35) 


This  last  update  is  known  as  the  Broyden-Fletcher-Goldfarb-Shanno  update  of  B^, 
and  it  plays  an  important  role  in  what  follows. 

Another  way  to  convert  an  updating  formula  for  H  to  one  for  B  or  vice  versa  is 
to  take  the  inverse.  Clearly,  if 


H^+iq,  =  Pp  0  <i<k,  (10.36) 

then 

qt  =  H^p;,  0  (10.37) 

which  implies  that  satisfies  (10.31),  the  criterion  for  an  update  of  B.  Also,  most 
importantly,  the  inverse  of  a  rank  two  formula  is  itself  a  rank  two  formula. 

The  new  formula  can  be  found  explicitly  by  two  applications  of  the  general  in¬ 
version  identity  (often  referred  to  as  the  Sherman-Morrison  formula) 

A  +  abrr'  =  A~'  -  A~‘af- V .  (10.38) 

1  1  +  brA~'a 

where  A  is  an  n  x  n  matrix,  and  a  and  b  are  n-vectors,  which  is  valid  provided  the 
inverses  exist.  (This  is  easily  verified  by  multiplying  through  by  A  +  abr.) 

The  Broyden-Fletcher-Goldfard-Shanno  update  for  B  produces,  by  taking  the 
inverse,  a  corresponding  update  for  H  of  the  form 


H 


BFGS 
k+ 1 


"  I  +  q[H^q/{ ' 

^  q[q k  ^ 


pp[ 

P/N 


P/q[H/  +  HtqAq[ 
q[p/c 


(10.39) 


This  is  an  important  update  formula  that  can  be  used  exactly  like  the  DFP  formula. 
Numerical  experiments  have  repeatedly  indicated  that  its  performance  is  superior  to 
that  of  the  DFP  formula,  and  for  this  reason  it  is  now  generally  preferred. 

It  can  be  noted  that  both  the  DFP  and  the  BFGS  updates  have  symmetric  rank 
two  corrections  that  are  constructed  from  the  vectors  and  Hy^.  Weighted  combi¬ 
nations  of  these  formulae  will  therefore  also  be  of  this  same  type  (symmetric,  rank 
two,  and  constructed  from  p*  and  H^q^).  This  observation  naturally  leads  to  consid¬ 
eration  of  a  whole  collection  of  updates,  known  as  the  Broyden  family,  defined  by 


H"''  =  ( 1  -  0)Hdfp  +  (/;Hbfgs 


(10.40) 
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where  0  is  a  parameter  that  may  take  any  real  value.  Clearly  0  =  0  and  0=1  yield 
the  DFP  and  BFGS  updates,  respectively.  The  Broyden  family  also  includes  the  rank 
one  update  (see  Exercise  12). 

An  explicit  representation  of  the  Broyden  family  can  be  found,  after  a  fair  amount 
of  algebra,  to  be 


H/q/q[H/ 

q'H/q; 


+ <p\k\i = H“r + <p\kvi. 


(10.41) 


where 


v,  =  (q[H,q,)l/2 


P/ 

,p[q  * 


H/q; 

q,'H/cq/c 


This  form  will  be  useful  in  some  later  developments. 

A  Broyden  method  is  defined  as  a  quasi-Newton  method  in  which  at  each  iteration 
a  member  of  the  Broyden  family  is  used  as  the  updating  formula.  The  parameter  0 
is,  in  general,  allowed  to  vary  from  one  iteration  to  another,  so  a  particular  Broyden 
method  is  defined  by  a  sequence  0 1,  02, . . .,  of  parameter  values.  A  pure  Broyden 
method  is  one  that  uses  a  constant  0. 

Since  both  HDFP  and  HBFGS  satisfy  the  fundamental  relation  (10.29)  for  updates, 
this  relation  is  also  satisfied  by  all  members  of  the  Broyden  family.  Thus  it  can  be 
expected  that  many  properties  that  were  found  to  hold  for  the  DFP  method  will 
also  hold  for  any  Broyden  method,  and  indeed  this  is  so.  The  following  is  a  direct 
extension  of  the  theorem  of  Sect.  10.3. 


Theorem.  If  f  is  quadratic  with  positive  definite  Hessian  F,  then  for  a  Broydenmethod 

pjFpj  =  0,  0  <  i  <  j  <  k 

Hk+\Fpi  =  pt  for  0  <i<k. 


Proof.  The  proof  parallels  that  of  Sect.  10.3,  since  the  results  depend  only  on  the 
basic  relation  (10.29)  and  the  orthogonality  (10.19)  because  of  exact  line  search.  I 

The  Broyden  family  does  not  necessarily  preserve  positive  definiteness  of 
for  all  values  of  0.  However,  we  know  that  the  DFP  method  does  preserve  positive 
definiteness.  Hence  from  (10.41)  it  follows  that  positive  definiteness  is  preserved 
for  any  0^0,  since  the  sum  of  a  positive  definite  matrix  and  a  positive  semidefinite 
matrix  is  positive  definite.  For  0  <  0  there  is  the  possibility  that  may  become 
singular,  and  thus  special  precautions  should  be  introduced.  In  practice  0  >  0  is 
usually  imposed  to  avoid  difficulties. 

There  has  been  considerable  experimentation  with  Broyden  methods  to  deter¬ 
mine  superior  strategies  for  selecting  the  sequence  of  parameters  0£. 

The  above  theorem  shows  that  the  choice  is  irrelevant  in  the  case  of  a  quadratic 
objective  and  accurate  line  search.  More  surprisingly,  it  has  been  shown  that  even  for 
the  case  of  nonquadratic  functions  and  accurate  line  searches,  the  points  generated 
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by  all  Broyden  methods  will  coincide  (provided  singularities  are  avoided  and 
multiple  minima  are  resolved  consistently).  This  means  that  differences  in  methods 
are  important  only  with  inaccurate  line  search. 

For  general  nonquadratic  functions  of  modest  dimension,  Broyden  methods  seem 
to  offer  a  combination  of  advantages  as  attractive  general  procedures.  First,  they  re¬ 
quire  only  that  first-order  (that  is,  gradient)  information  be  available.  Second,  the 
directions  generated  can  always  be  guaranteed  to  be  directions  of  descent  by  arrang¬ 
ing  for  to  be  positive  definite  throughout  the  process.  Third,  since  for  a  quadratic 
problem  the  matrices  H/,  converge  to  the  inverse  Hessian  in  at  most  n  steps,  it  might 
be  argued  that  in  the  general  case  will  converge  to  the  inverse  Hessian  at  the 
solution,  and  hence  convergence  will  be  superlinear.  Unfortunately,  while  the  meth¬ 
ods  are  certainly  excellent,  their  convergence  characteristics  require  more  careful 
analysis,  and  this  will  lead  us  to  an  important  additional  modification. 


Partial  Quasi-Newton  Methods 

There  is,  of  course,  the  option  of  restarting  a  Broyden  method  every  m  +  1  steps, 
where  m  +  1  <  n.  This  would  yield  a  partial  quasi-Newton  method  that,  for  small 
values  of  m,  would  have  modest  storage  requirements,  since  the  approximate  inverse 
Hessian  could  be  stored  implicitly  by  storing  only  the  vectors  p,  and  q*  ,  i  <  m+ 1.  In 
the  quadratic  case  this  method  exactly  corresponds  to  the  partial  conjugate  gradient 
method  and  hence  it  has  similar  convergence  properties. 


10.5  Convergence  Properties 

The  various  schemes  for  simultaneously  generating  and  using  an  approximation  to 
the  inverse  Hessian  are  difficult  to  analyze  definitively.  One  must  therefore,  to  some 
extent,  resort  to  the  use  of  analogy  and  approximate  analyses  to  determine  their 
effectiveness.  Nevertheless,  the  machinery  we  developed  earlier  provides  a  basis  for 
at  least  a  preliminary  analysis. 


Global  Convergence 

In  practice,  quasi-Newton  methods  are  usually  executed  in  a  continuing  fashion, 
starting  with  an  initial  approximation  and  successively  improving  it  throughout  the 
iterative  process.  Under  various  and  somewhat  stringent  conditions,  it  can  be  proved 
that  this  procedure  is  globally  convergent.  If,  on  the  other  hand,  the  quasi-Newton 
methods  are  restarted  every  n  or  n  +  1  steps  by  resetting  the  approximate  inverse 
Hessian  to  its  initial  value,  then  global  convergence  is  guaranteed  by  the  presence 
of  the  first  descent  step  of  each  cycle  (which  acts  as  a  spacer  step). 
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Local  Convergence 

The  local  convergence  properties  of  quasi-Newton  methods  in  the  pure  form  dis¬ 
cussed  so  far  are  not  as  good  as  might  first  be  thought.  Let  us  focus  on  the  local 
convergence  properties  of  these  methods  when  executed  with  the  restarting  feature. 
Specifically,  consider  a  Broyden  method  and  for  simplicity  assume  that  at  the  begin¬ 
ning  of  each  cycle  the  approximate  inverse  Hessian  is  reset  to  the  identity  matrix. 
Each  cycle,  if  at  least  n  steps  in  duration,  will  then  contain  one  complete  cycle  of  an 
approximation  to  the  conjugate  gradient  method.  Asymptotically,  in  the  tail  of  the 
generated  sequence,  this  approximation  becomes  arbitrarily  accurate,  and  hence  we 
may  conclude,  as  for  any  method  that  asymptotically  approaches  the  conjugate  gra¬ 
dient  method,  that  the  method  converges  superlinearly  (at  least  if  viewed  at  the  end 
of  each  cycle).  Although  superlinear  convergence  is  attractive,  the  fact  that  in  this 
case  it  hinges  on  repeated  cycles  of  n  steps  in  duration  can  seriously  detract  from  its 
practical  significance  for  problems  with  large  n ,  since  we  might  hope  to  terminate 
the  procedure  before  completing  even  a  single  full  cycle  of  n  steps. 

To  obtain  insight  into  the  defects  of  the  method,  let  us  consider  a  special  situation. 
Suppose  that  /  is  quadratic  and  that  the  eigenvalues  of  the  Hessian,  F,  of  /  are  close 
together  but  all  very  large.  If,  starting  with  the  identity  matrix,  an  approximation 
to  the  inverse  Hessian  is  updated  m  times,  the  matrix  HmF  will  have  m  eigenvalues 
equal  to  unity  and  the  rest  will  still  be  large.  Thus,  the  ratio  of  smallest  to  largest 
eigenvalue  of  HmF,  the  condition  number,  will  be  worse  than  for  F  itself.  Therefore, 
if  the  updating  were  discontinued  and  Hm  were  used  as  the  approximation  to  F-1  in 
future  iterations  according  to  the  procedure  of  Sect.  10.1,  we  see  that  convergence 
would  be  poorer  than  it  would  be  for  ordinary  steepest  descent.  In  other  words,  the 
approximations  to  F-1  generated  by  the  updating  formulas,  although  accurate  over 
the  subspace  traveled,  do  not  necessarily  improve  and,  indeed,  are  likely  to  worsen 
the  eigenvalue  structure  of  the  iteration  process. 

In  practice  a  poor  eigenvalue  structure  arising  in  this  manner  will  play  a  domi¬ 
nating  role  whenever  there  are  factors  that  tend  to  weaken  its  approximation  to  the 
conjugate  gradient  method.  Common  factors  of  this  type  are  round-off  errors,  in¬ 
accurate  line  searches,  and  nonquadratic  terms  in  the  objective  function.  Indeed,  it 
has  been  frequently  observed,  empirically,  that  performance  of  the  DFP  method  is 
highly  sensitive  to  the  accuracy  of  the  line  search  algorithm — to  the  point  where 
superior  step-wise  convergence  properties  can  only  be  obtained  through  excessive 
time  expenditure  in  the  line  search  phase. 


Example.  To  illustrate  some  of  these  conclusions  we  consider  the  six-dimensional 
problem  defined  by 
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where 


40  0  0  0  0  0 

0  38  0  0  0  0 

0  0  36  0  0  0 

0  0  0  34  0  0 

0  0  0  0  32  0 

0  0  0  0  0  30 


This  function  was  minimized  iteratively  (the  solution  is  obviously  x*  =  0)  starting 
at  xo  =(10,  10,  10,  10,  10,  10),  with  /(xo)  =  10,500,  by  using,  alternatively,  the 
method  of  steepest  descent,  the  DFP  method,  the  DFP  method  restarted  every  six 
steps,  and  the  self-scaling  method  described  in  the  next  section.  For  this  quadratic 
problem  the  appropriate  step  size  to  take  at  any  stage  can  be  calculated  by  a  simple 
formula.  On  different  computer  runs  of  a  given  method,  different  levels  of  error  were 
deliberately  introduced  into  the  step  size  in  order  to  observe  the  effect  of  line  search 
accuracy.  This  error  took  the  form  of  a  fixed  percentage  increase  over  the  optimal 
value.  The  results  are  presented  below: 


Case  1.  No  error  in  step  size  a 
Function  value 


Iteration  Steepest  descent  DFP  DFP  (with  restart)  Self-scaling 


1 

96.29630  96.29630 

96.29630 

96.29630 

2 

1.560669  6.900839x  10-' 

6.900839  x  10"' 

6.900839  x  10“' 

3 

2.932559  x  10“2  3.988497  x  10“3 

3.988497  x  10“3 

3.988497  x  10“3 

4 

5.787315  x  10“4  1.683310  x  10“5 

1.683310  x  10“5 

1.683310  x  10-5 

5 

1.164595  x  10~5  3.878639  x  10~8 

3.878639  x  10~8 

3.878639  x  10~8 

6 

2.359563  x  10~7 

Case  2.  0.1  %  error  in  step  size  a 


Function  value 


Iteration  Steepest  descent  DFP 

DFP  (with  restart)  Self-scaling 

1 

96.30669  96.30669 

96.30669 

96.30669 

2 

1.564971  6.994023  x  10“ 1 

6.994023  x  10-' 

6.902072  x  10“' 

3 

2.939804  x  10“2  1.225501  x  10“2 

1.225501  x  10-2 

3.989507  x  10~3 

4 

5.810123  x  10“4  7.301088  x  10“3 

7.301088  x  10“3 

1.684263  x  10"5 

5 

6 

7 

1.169205  x  10"5  2.636716  x  10~3 
2.372385  x  10~7  1.031086  x  10~5 

3.633330  x  10“9 

2.636716  x  10"3 
1.031086  x  10"5 
2.399278  x  10“8 

3.881674  x  10“8 
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Case  3.1%  error  in  step  size  a 
Function  value 

Iteration  Steepest  descent  DFP  DFP  (with  restart)  Self-scaling 

1  97.33665  97.33665  97.33665  97.33665 

2  1.586251  1.621908  1.621908  0.7024872 

3  2.989875  x  1(T2  8.268893  x  1CT1  8.268893  x  10“'  4.090350  x  10"3 

4  5.908101  x  10~4  4.302943  x  10"1  4.302943  x  10“'  1.779424  x  10"5 

5  1.194144  x  10-5  4.449852  x  10"3  4.449852  x  10~3  4.195668  x  10~8 

6  2.422985  x  10“7  5.337835  x  10“5  5.337835  x  10“5 

7  3.767830  x  10~5  4.493397  x  10~7 

8  3.768097  x  10~9 


Case  4.  10  %  error  in  step  size  a 
Function  value 


Iteration  Steepest  descent  DFP  DFP  (with  restart)  Self-scaling 


1 

200.333  200.333 

200.333 

200.333 

2 

2.732789  93.65457 

93.65457 

2.811061 

3 

3.836899  x  10“2  56.92999 

56.92999 

3.562769  x  10“2 

4 

6.376461  x  10“4  1.620688 

1.620688 

4.200600  x  10-4 

5 

1.219515  x  10"5  5.251 1 15  x  10'1 

5.251115  x  10-1 

4.726918  x  10~6 

6 

2.457944  x  10“7  3.323745  x  10”1 

3.323745  x  10”1 

7 

6.150890  x  10"3 

8.102700  x  10“3 

8 

3.025393  x  10~3 

2.973021  x  10“3 

9 

3.025476  x  10“5 

1.950152  x  10“3 

10 

3.025476  x  10“7 

2.769299  x  10“5 

11 

1.760320  x  10~5 

12 

1.123844  x  10~6 

We  note  first  that  the  error  introduced  is  reported  as  a  percentage  of  the  step 
size  itself.  In  terms  of  the  change  in  function  value,  the  quantity  that  is  most  often 
monitored  to  determine  when  to  terminate  a  line  search,  the  fractional  error  is  the 
square  of  that  in  the  step  size.  Thus,  a  one  percent  error  in  step  size  is  equivalent  to 
a  0.01  %  error  in  the  change  in  function  value. 

Next  we  note  that  the  method  of  steepest  descent  is  not  radically  affected  by  an 
inaccurate  line  search  while  the  DFP  methods  are.  Thus  for  this  example  while  DFP 
is  superior  to  steepest  descent  in  the  case  of  perfect  accuracy,  it  becomes  inferior  at 
an  error  of  only  0.1  %  in  step  size. 
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10.6  Scaling 

There  is  a  general  viewpoint  about  what  makes  up  a  desirable  descent  method  that 
underlies  much  of  our  earlier  discussions  and  which  we  now  summarize  briefly  in 
order  to  motivate  the  presentation  of  scaling.  A  method  that  converges  to  the  exact 
solution  after  n  steps  when  applied  to  a  quadratic  function  on  En  has  obvious  appeal 
especially  if,  as  is  usually  the  case,  it  can  be  inferred  that  for  nonquadratic  problems 
repeated  cycles  of  length  n  of  the  method  will  yield  superlinear  convergence.  For 
problems  having  large  n ,  however,  a  more  sophisticated  criterion  of  performance 
needs  to  be  established,  since  for  such  problems  one  usually  hopes  to  be  able  to 
terminate  the  descent  process  before  completing  even  a  single  full  cycle  of  length 
n.  Thus,  with  these  sorts  of  problems  in  mind,  the  finite-step  convergence  property 
serves  at  best  only  as  a  sign  post  indicating  that  the  algorithm  might ,  make  rapid 
progress  in  its  early  stages.  It  is  essential  to  insure  that  in  fact  it  will  make  rapid 
progress  at  every  stage.  Furthermore,  the  rapid  convergence  at  each  step  must  not 
be  tied  to  an  assumption  on  conjugate  directions,  a  property  easily  destroyed  by 
inaccurate  line  search  and  nonquadratic  objective  functions.  With  this  viewpoint 
it  is  natural  to  look  for  quasi-Newton  methods  that  simultaneously  possess  favor¬ 
able  eigenvalue  structure  at  each  step  (in  the  sense  of  Sect.  10.1)  and  reduce  to  the 
conjugate  gradient  method  if  the  objective  function  happens  to  be  quadratic.  Such 
methods  are  developed  in  this  section. 


Improvement  of  Eigenvalue  Ratio 

Referring  to  the  example  presented  in  the  last  section  where  the  Davidon-Fletcher- 
Powell  method  performed  poorly,  we  can  trace  the  difficulty  to  the  simple  observa¬ 
tion  that  the  eigenvalues  of  HoQ  are  all  much  larger  than  unity.  The  DFP  algorithm, 
or  any  Broyden  method,  essentially  moves  these  eigenvalues,  one  at  a  time,  to  unity 
thereby  producing  an  unfavorable  eigenvalue  ratio  in  each  H^Q  for  1  <  k  <  n.  This 
phenomenon  can  be  attributed  to  the  fact  that  the  methods  are  sensitive  to  simple 
scale  factors.  In  particular  if  Ho  were  multiplied  by  a  constant,  the  whole  process 
would  be  different.  In  the  example  of  the  last  section,  if  H0  were  scaled  by,  for  in¬ 
stance,  multiplying  it  by  1/35,  the  eigenvalues  of  HoQ  would  be  spread  above  and 
below  unity,  and  in  that  case  one  might  suspect  that  the  poor  performance  would  not 
show  up. 

Motivated  by  the  above  considerations,  we  shall  establish  conditions  under  which 
the  eigenvalue  ratio  of  H^+iF  is  at  least  as  favorable  as  that  of  H^F  in  a  Broyden 
method.  These  conditions  will  then  be  used  as  a  basis  for  introducing  appropriate 
scale  factors. 

We  use  (but  do  not  prove)  the  following  matrix  theoretic  result  due  to  Loewner. 

Interlocking  Eigenvalues  Lemma.  Let  the  symmetric  nxn  matrix  A  have  eigen¬ 
values  A\  <  A2  <  . . .  <  An.  Let  a  be  any  vector  in  En  and  denote  the  eigenvalues  of 
the  matrix  A  +  aar  by  fi\  <  fi2  •  •  •  <  En-  Then  A\  <  jd\  <  A2  <  /12  •  •  •  <  An  <  \in. 
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For  convenience  we  introduce  the  following  definitions: 

R,  =  F^F  f 

n  =  Ffp  t. 

1  /  2 

Then  using  =  F,/  r^,  it  can  be  readily  verified  that  (10.41)  is  equivalent  to 


<i  = R*  - 


Rfcr/cr[Rfc  .  rkrTk 

r[R/.-r/>- 


+ 


r{rk 


+  (, pzkzk , 


(10.42) 


where 


R^r 

rj^Rjkr*, 


Since  is  similar  to  H^F  (because  H^F  =  F1/2R^F1/2),  both  have  the  same  eigen¬ 
values.  It  is  most  convenient,  however,  in  view  of  (10.42)  to  study  R^,  obtaining 
conclusions  about  H^F  indirectly. 

Before  proving  the  general  theorem  we  shall  consider  the  case  0  =  0  correspond¬ 
ing  to  the  DFP  formula.  Suppose  the  eigenvalues  of  R^  are  A\,  A2,  . . . ,  An  with 
0  <  A\  <  A2  <  . . .  <  An.  Suppose  also  that  1  e  [A\,  An].  We  will  show  that  the  eigen¬ 
values  of  Rfc+i  are  all  contained  in  the  interval  [A\,  An],  which  of  course  implies  that 
Ryt+i  is  no  worse  than  R&  in  terms  of  its  condition  number.  Let  us  first  consider  the 
matrix 


P  =  R,- 


R*r*r[RA; 

r[R  k*k 


We  see  that  Pr^  =  0  so  one  eigenvalue  of  P  is  zero.  If  we  denote  the  eigenvalues  of 
P  by  fii  <  ji2  <  ...  <  /in,  we  have  from  the  above  observation  and  the  lemma  on 
interlocking  eigenvalues  that 


0  =  <  A\  <  Jl2  <  . . .  <  nn  <  An. 


Next  we  consider 


Rkrkr[Rk 
r[  Rkrk 


+ 


T 

r*r* 

T 

**r* 


=  P  + 


T 

rj4 

T 

TkTk 


(10.43) 


Since  rk  is  an  eigenvector  of  P  and  since,  by  symmetry,  all  other  eigenvectors  of  P 
are  therefore  orthogonal  to  rk,  it  follows  that  the  only  eigenvalue  different  in  R^+i 
from  in  P  is  the  one  corresponding  to  rk — it  now  being  unity.  Thus  R^+i  has  eigen¬ 
values  /i2,  /i3,  . . . ,  idn  and  unity.  These  are  all  contained  in  the  interval  [A\,  An]. 
Thus  updating  does  not  worsen  the  eigenvalue  ratio.  It  should  be  noted  that  this 
result  in  no  way  depends  on  ak  being  selected  to  minimize  /. 
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We  now  extend  the  above  to  the  Broyden  class  with  0  <  cp  <  1 . 

Theorem.  Let  the  n  eigenvalues  of  H^F  be  A\,  A2,  . . . ,  An  with  0  <  A\  <  A2  <  . . .  <  An. 
Suppose  that  1  £  [A\,  An].  Then  for  any  0,  0  <  0  <  1,  the  eigenvalues  ofH^k+lF,  where 

Hti  is  defined  by  (10.41),  are  all  contained  in  [A\,  An ]. 


Proof.  The  result  shown  above  corresponds  to  0  =  0.  Let  us  now  consider  0=1, 
corresponding  to  the  BFGS  formula.  By  our  original  definition  of  the  BFGS  update, 
H  1  is  defined  by  the  formula  that  is  complementary  to  the  DFP  formula.  Thus 


1L1 


=  Hr1  + 


q[p* 


-1 

k 


This  is  equivalent  to 


r*+i = - 


,-i 


R/71 


T  T>-1 


WR 


k  “/c 


*-[R*V 


+ 


T 

V_kfk_ 

T 

n  n 


(10.44) 


which  is  identical  to  (10.43)  except  that  is  replaced  by  R"1. 

The  eigenvalues  of  R"1  are  l/An  <  l/An-i  <  ...  <  1/Ti.  Clearly,  1  £ 
[l /An,  1/Ti].  Thus  by  the  preliminary  result,  if  the  eigenvalues  of  R^  are  de¬ 
noted  1  /nn  <  1//4-1  <  ...  <  l//ii,  it  follows  that  they  are  contained  in  the  interval 
[1  /An,  1/A\].  Thus  l /An  <  1  / gin  and  \ / A\  >  \/fi\.  When  inverted  this  yields  >  A\ 
and  nn  <  An ,  which  shows  that  the  eigenvalues  of  R^+i  are  contained  in  [A\,  An\. 
This  establishes  the  result  for  0  =  1 . 

For  general  0  the  matrix  R^+1  defined  by  (10.42)  has  eigenvalues  that  are  all 
monotonically  increasing  with  0  (as  can  be  seen  from  the  interlocking  eigenvalues 
lemma).  However,  from  above  it  is  known  that  these  eigenvalues  are  contained  in 
[A\,  An\  for  0  =  0  and  0=1.  Hence,  they  must  be  contained  in  [A\,  An]  for  all 
0,  0  <  0  <  1.1 


Scale  Factors 

In  view  of  the  result  derived  above,  it  is  clearly  advantageous  to  scale  the  matrix 
so  that  the  eigenvalues  of  H^F  are  spread  both  below  and  above  unity.  Of  course 
in  the  ideal  case  of  a  quadratic  problem  with  perfect  line  search  this  is  strictly  only 
necessary  for  H0,  since  unity  is  an  eigenvalue  of  H^F  for  k  >  0.  But  because  of 
the  inescapable  deviations  from  the  ideal,  it  is  useful  to  consider  the  possibility  of 
scaling  every  H^. 

A  scale  factor  can  be  incorporated  directly  into  the  updating  formula.  We  first 
multiply  Hk  by  the  scale  factor  y^  and  then  apply  the  usual  updating  formula.  This 
is  equivalent  to  replacing  by  y^H^  in  (10.42)  and  leads  to 
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H/q/(|[  H/ 
tl[  H*q* 


+  <, h'/k'/l 


7k  + 


7 


PaP[ 
P  W 


(10.45) 


This  defines  a  two-parameter  family  of  updates  that  reduces  to  the  Broyden  family 
for  yk  =  I- 

Using  yo,  yi,  ...  as  arbitrary  positive  scale  factors,  we  consider  the  algorithm: 
Start  with  any  symmetric  positive  definite  matrix  Ho  and  any  point  xo,  then  starting 
with  k  =  0, 

Step  1.  Set  dk  = 

Step  2.  Minimize  /(x^  +  adf)  with  respect  to  a  >  0  to  obtain  x^+i,  =  otk dk, 

and  gjt+i. 

Step  3.  Set  q^  =  gk+l  -  gk  and 


v 


H/q/q'Ht 


+  0AVav[ 


\ 


7  k  + 


/ 


PaP[ 

p[q* 


\k  =  (q[Hq01/2 


Pa 

-  p[  a 


Ha^a 

q[H/q/ , 


(10.46) 


The  use  of  scale  factors  does  destroy  the  property  H„  =  F_1  in  the  quadratic  case, 
but  it  does  not  destroy  the  conjugate  direction  property.  The  following  properties  of 
this  method  can  be  proved  as  simple  extensions  of  the  results  given  in  Sect.  10.3. 

1.  If  Hk  is  positive  definite  and  p^q^  ^  0,  (10.46)  yields  an  H^+i  that  is  positive 
definite. 

2.  If  /  is  quadratic  with  Hessian  F,  then  the  vectors  Po,  Pi ,  . . . ,  pn_i  are  mutually 
F-orthogonal,  and,  for  each  k ,  the  vectors  Po,  pls  . . . ,  p^  are  eigenvectors  of 
H,+1F. 

We  can  conclude  that  scale  factors  do  not  destroy  the  underlying  conjugate  be¬ 
havior  of  the  algorithm.  Hence  we  can  use  scaling  to  ensure  good  single-step  con¬ 
vergence  properties. 


A  Self-scaling  Quasi-Newton  Algorithm 

The  question  that  arises  next  is  how  to  select  appropriate  scale  factors.  If  A\  < 
A2  <  . . .  <  An  are  the  eigenvalues  of  H^F,  we  want  to  multiply  by  jk  where 
A\  <  l/yu  <  An.  This  will  ensure  that  the  new  eigenvalues  contain  unity  in  the 
interval  they  span. 

Note  that  in  terms  of  our  earlier  notation 

q[  H/qA  r  ^  Ra  r  a 

T  —  T 

p[q *  rlrA 
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Recalling  that  has  the  same  eigenvalues  as  H^F  and  noting  that  for  any 


ifR^ 

M  <  "S -  <  An, 

<rk 


we  see  that 


=  p?«it 

7k  q[«,q, 


(10.47) 


serves  as  a  suitable  scale  factor. 

We  now  state  a  complete  self-scaling,  restarting,  quasi-Newton  method  based  on 
the  ideas  above.  For  simplicity  we  take  0  =  0  and  thus  obtain  a  modification  of  the 
DFP  method.  Start  at  any  point  xq,  k  =  0. 


Step  1.  Set  Hyt  =  I. 

Step  2.  Set  =  -H*g*. 

Step  3.  Minimize  /(x^  +  adk)  with  respect  to  a  >  0  to  obtain  ak,  x^+i,  = 

akdk ,  g^+1  and  =  g^+1  -  g^.  (Select  otk  accurately  enough  to  ensure  pjq^  >  0.) 
Step  4.  If  k  is  not  an  integer  multiple  of  n ,  set 


H/q/qjH/  |  p[q;,- 

q'  H/  q;  J  q^H/  q; 


P/.P[ 

p*  q* 


(10.48) 


Add  one  to  k  and  return  to  Step  2.  If  k  is  an  integer  multiple  of  n ,  return  to  Step  1 . 
This  algorithm  was  run,  with  various  amounts  of  inaccuracy  introduced  in  the  line 
search,  on  the  quadratic  problem  presented  in  Sect.  10.4.  The  results  are  presented 
in  that  section. 


10.7  Memoryless  Quasi-Newton  Methods 

The  preceding  development  of  quasi-Newton  methods  can  be  used  as  a  basis  for 
reconsideration  of  conjugate  gradient  methods.  The  result  is  an  attractive  class  of 
new  procedures. 

Consider  a  simplification  of  the  BFGS  quasi-Newton  method  where  H^+i  is  de¬ 
fined  by  a  BFGS  update  applied  to  H  =  I,  rather  than  to  H^.  Thus  H^+i  is  determined 
without  reference  to  the  previous  H^,  and  hence  the  update  procedure  is  memoryless. 
This  update  procedure  leads  to  the  following  algorithm:  Start  at  any  point  xo,  k  =  0. 

Step  1. 

Set  Hk  -  I.  (10.49) 


Step  2. 


Set  d  k  =  -H*g*. 


(10.50) 
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Step  3.  Minimize  f(xk  +  adk)  with  respect  to  a  >  0  to  obtain  x^+i,  = 

akdk ,  gA.+1,  and  =  g^+1  -  gk.  (Select  ak  accurately  enough  to  ensure  pj qk  >  0.) 
Step  4.  If  k  is  not  an  integer  multiple  of  ft,  set 


=  1- 


tl/P/r  +  P;<|[ 
P/cir 


+ 


1  + 


p^p[ 

p[<r 


(10.51) 


Add  1  to  k  and  return  to  Step  2.  If  k  is  an  integer  multiple  of  ft,  return  to  Step  1. 
Combining  (10.50)  and  (10.51),  it  is  easily  seen  that 


dk+i  -  -gjfe+i  + 


q/,-p[&,-+i  +  jyq[&+i 

p[  (i< 


i  + 


q[qO  p*p[g*-i 

p[qJ  p[(r 


(10.52) 


If  the  line  search  is  exact,  then  pjg^+1  =  0  and  hence  pjq k  =  -p^g^-  In  this  case 
(10.52)  is  equivalent  to 


d,+1  =  -g*+1  +  =  -g,+1  +/3k  d,,  (10.53) 


where 

Pk  =  — F - • 

This  coincides  exactly  with  the  Polak-Ribiere  form  of  the  conjugate  gradient  method. 
Thus  use  of  the  BFGS  update  in  this  way  yields  an  algorithm  that  is  of  the  modified 
Newton  type  with  positive  definite  coefficient  matrix  and  which  is  equivalent  to  a 
standard  implementation  of  the  conjugate  gradient  method  when  the  line  search  is 
exact. 

The  algorithm  can  be  used  without  exact  line  search  in  a  form  that  is  similar 
to  that  of  the  conjugate  gradient  method  by  using  (10.52).  This  requires  storage  of 
only  the  same  vectors  that  are  required  of  the  conjugate  gradient  method.  In  light 
of  the  theory  of  quasi-Newton  methods,  however,  the  new  form  can  be  expected 
to  be  superior  when  inexact  line  searches  are  employed,  and  indeed  experiments 
confirm  this. 

The  above  idea  can  be  easily  extended  to  produce  a  memoryless  quasi-Newton 
method  corresponding  to  any  member  of  the  Broyden  family.  The  update  formula 
(10.51)  would  simply  use  the  general  Broyden  update  (10.41)  with  set  equal  to 
I.  In  the  case  of  exact  line  search  (with  p^g^+1  =  0),  the  resulting  formula  for  d^+i 
reduces  to 

T  T 

j  / 1  1  r\  c  a  \ 

d-t+i  =  -g/c+i  +  (1  -  <P)~ — < \k  +  <P— — Pk  (10.54) 

%  q<  pi  <i/ 

We  note  that  (10.54)  is  equivalent  to  the  conjugate  gradient  direction  (10.53)  only 
for  0  -  1,  corresponding  to  the  BFGS  update.  For  this  reason  the  choice  0  =  1  is 
generally  preferred  for  this  type  of  method. 
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Scaling  and  Preconditioning 


Since  the  conjugate  gradient  method  implemented  as  a  memoryless  quasi-Newton 
method  is  a  modified  Newton  method,  the  fundamental  convergence  theory  based 
on  condition  number  emphasized  throughout  this  part  of  the  book  is  applicable,  as 
are  the  procedures  for  improving  convergence.  It  is  clear  that  the  function  scaling 
procedures  discussed  in  the  previous  section  can  be  incorporated. 

According  to  the  general  theory  of  modified  Newton  methods,  it  is  the  eigenval¬ 
ues  of  H,' F(XjQ  that  influence  the  convergence  properties  of  these  algorithms.  From 
the  analysis  of  the  last  section,  the  memoryless  BFGS  update  procedure  will,  in  the 
pure  quadratic  case,  yield  a  matrix  H^F  that  has  a  more  favorable  eigenvalue  ratio 
than  F  itself  only  if  the  function  /  is  scaled  so  that  unity  is  contained  in  the  interval 
spanned  by  the  eigenvalues  of  F.  Experimental  evidence  verifies  that  at  least  an  ini¬ 
tial  scaling  of  the  function  in  this  way  can  lead  to  significant  improvement.  Scaling 
can  be  introduced  at  every  step  as  well,  and  complete  self-scaling  can  be  effective 
in  some  situations. 

It  is  possible  to  extend  the  scaling  procedure  to  a  more  general  precondition¬ 
ing  procedure.  In  this  procedure  the  matrix  governing  convergence  is  changed  from 
F(x*;)  to  HF(x^)  for  some  H.  If  HF(x^)  has  its  eigenvalues  all  close  to  unity,  then 
the  memoryless  quasi-Newton  method  can  be  expected  to  perform  exceedingly 
well,  since  it  possesses  simultaneously  the  advantages  of  being  a  conjugate  gradient 
method  and  being  a  well-conditioned  modified  Newton  method. 

Preconditioning  can  be  conveniently  expressed  in  the  basic  algorithm  by  simply 
replacing  in  the  BFGS  update  formula  by  H  instead  of  I  and  replacing  I  by  H  in 
Step  1.  Thus  (10.51)  becomes 


H*+1  =  H  - 


hQ/p[  +  p*q[h 


+ 


i  + 


q[  Hq/ 

pN 


\ 


P^P[ 


(10.55) 


and  the  explicit  conjugate  gradient  version  (10.52)  is  also  modified  accordingly. 

Preconditioning  can  also  be  used  in  conjunction  with  an  ( m  +  l)-cycle  partial 
conjugate  gradient  version  of  the  memoryless  quasi-Newton  method.  This  is  highly 
effective  if  a  simple  H  can  be  found  (as  it  sometimes  can  in  problems  with  structure) 
so  that  the  eigenvalues  of  HF(x^)  are  such  that  either  all  but  m  are  equal  to  unity  or 
they  are  in  m  bunches.  For  large-scale  problems,  methods  of  this  type  seem  to  be 
quite  promising. 


*10.8  *  Combination  of  Steepest  Descent  and  Newton’s  Method 

In  this  section  we  digress  from  the  study  of  quasi-Newton  methods,  and  again 
expand  our  collection  of  basic  principles.  We  present  a  combination  of  steepest  de¬ 
scent  and  Newton’s  method  which  includes  them  both  as  special  cases.  The  resulting 
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combined  method  can  be  used  to  develop  algorithms  for  problems  having  special 
structure,  as  illustrated  in  Chap.  13.  This  method  and  its  analysis  comprises  a  fun¬ 
damental  element  of  the  modern  theory  of  algorithms. 

The  method  itself  is  quite  simple.  Suppose  there  is  a  subspace  N  of  En  on  which 
the  inverse  Hessian  of  the  objective  function  /  is  known  (we  shall  make  this  state¬ 
ment  more  precise  later).  Then,  in  the  quadratic  case,  the  minimum  of  /  over  any 
linear  variety  parallel  to  N  (that  is,  any  translation  of  N)  can  be  found  in  a  single 
step.  To  minimize  /  over  the  whole  space  starting  at  any  point  x*,  we  could  mini¬ 
mize  /  over  the  linear  variety  parallel  to  N  and  containing  to  obtain  z^;  and  then 
take  a  steepest  descent  step  from  there.  This  procedure  is  illustrated  in  Fig.  10.1. 
Since  Zk  is  the  minimum  point  of  /  over  a  linear  variety  parallel  to  N ,  the  gradient 
at  Zk  will  be  orthogonal  to  N,  and  hence  the  gradient  step  is  orthogonal  to  N.  If  / 
is  not  quadratic  we  can,  knowing  the  Hessian  of  /  on  N,  approximate  the  minimum 
point  of  /  over  a  linear  variety  parallel  to  N  by  one  step  of  Newton’s  method.  To 
implement  this  scheme,  that  we  described  in  a  geometric  sense,  it  is  necessary  to 
agree  on  a  method  for  defining  the  subspace  N  and  to  determine  what  information 
about  the  inverse  Hessian  is  required  so  as  to  implement  a  Newton  step  over  N.  We 
now  turn  to  these  questions. 

Often,  the  most  convenient  way  to  describe  a  subspace,  and  the  one  we  follow 
in  this  development,  is  in  terms  of  a  set  of  vectors  that  generate  it.  Thus,  if  B  is  an 
n  x  m  matrix  consisting  of  m  column  vectors  that  generate  N,  we  may  write  N  as  the 
set  of  all  vectors  of  the  form  Bu  where  u  e  Em.  For  simplicity  we  always  assume 
that  the  columns  of  B  are  linearly  independent. 

To  see  what  information  about  the  inverse  Hessian  is  required,  imagine  that  we 
are  at  a  point  and  wish  to  find  the  approximate  minimum  point  zk  of  /  with 
respect  to  movement  in  N.  Thus,  we  seek  so  that 


zk  =  xk  +  Bu£ 

approximately  minimizes  /.  By  “approximately  minimizes”  we  mean  that  should 
be  the  Newton  approximation  to  the  minimum  over  this  subspace.  We  write 

/(Zt)  =  f(Xk)  +  V/(x*)  Bu*  +  ^u[BrF('x/f)Bu/c 

and  solve  for  to  obtain  the  Newton  approximation.  We  find 

u*  =  -(BrF(x,)B)-1BrV/(xifc)r 
xk=xk-  B(BrF(xi)B)-1BrV/(xi)r. 

We  see  by  analogy  with  the  formula  for  Newton’s  method  that  the  expression 
B(BrF(x(l)B)-1Br  can  be  interpreted  as  the  inverse  of  F(x^)  restricted  to  the  sub¬ 
space  N. 
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Fig.  10.1  Combined  method 


Example.  Suppose 


I 

0  ’ 


where  I  is  an  m  x  m  identity  matrix.  This  corresponds  to  the  case  where  N  is  the 
subspace  generated  by  the  first  m  unit  basis  elements  of  En .  Let  us  partition  F  = 
V2/(xO  as 

_  Fn  F 12 

F21  F22 

where  Fn  is  m  x  m.  Then,  in  this  case 


(BrFB)-1  =  F7/, 


and 

B(BrFB)_1Br  = 

which  shows  explicitly  that  it  is  the  inverse  of  F  on  A  that  is  required.  The  general 
case  can  be  regarded  as  being  obtained  through  partitioning  in  some  skew  coordinate 
system. 

Now  that  the  Newton  approximation  over  N  has  been  derived,  it  is  possible  to 
formalize  the  details  of  the  algorithm  suggested  by  Fig.  10.1.  At  a  given  point  x^, 
the  point  x^+i  is  determined  through 


Fr' « 

0  0 


a)  Setd*  =  -B(BrF(x^)B)-1BrVf(xi)r. 

b)  zk  =xk+  A-d/,,  where  flk  minimizes  f(x*  +  J3dk).  (10.56) 

c)  Set  p,  =  -Vf(zAr. 

d)  x^+i  =  Zk  +  &kPk’  where  ak  minimizes  f(zk  +  rrPk). 


The  scalar  search  parameter  J3k  is  introduced  in  the  Newton  part  of  the  algorithm 
simply  to  assure  that  the  descent  conditions  required  for  global  convergence  are 
met.  Normally  fa  will  be  approximately  equal  to  unity.  (See  Sect.  8.5.) 
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Analysis  of  Quadratic  Case 

Since  the  method  is  not  a  full  Newton  method,  we  can  conclude  that  it  possesses 
only  linear  convergence  and  that  the  dominating  aspects  of  convergence  will  be  re¬ 
vealed  by  an  analysis  of  the  method  as  applied  to  a  quadratic  function.  Furthermore, 
as  might  be  intuitively  anticipated,  the  associated  rate  of  convergence  is  governed 
by  the  steepest  descent  part  of  algorithm  (10.56),  and  that  rate  is  governed  by  a 
Kantorovich-like  ratio  defined  over  the  subspace  orthogonal  to  N. 

Theorem  (Combined  Method).  Let  Q  be  an  n  x  n  symmetric  positive  definite  matrix,  and 
let  x*  e  En.  Define  the  function 

E(x)  =  -x*)tQ(x  -x*) 

and  let  b  =  Qx*.  Let  B  be  an  n  x  m  matrix  of  rank  m.  Starting  at  an  arbitrary  point  Xq, 
define  the  iterative  process 

a)  uk  =  ~(Bt QB)1Bt gh  where  gk  =  Qxk  -  b. 

b)  zk  =  xk  +  Buk. 

c)  p k=b- Qzk. 

d)  xj.+  i  =  Zk  +  Ukpk,  where  ak  =  ffp  - 

This  process  converges  to  x*,  and  satisfies 

E(xk+l)  <  (l  -  S)E(xk)  (10.57) 

where  5,  0  <  8  <  1,  is  the  minimum  of 

( PTP )2 

(ptQp)(ptQ~'p ) 

over  all  vectors  p  in  the  nullspace  of  Br. 

Proof.  The  algorithm  given  in  the  theorem  statement  is  exactly  the  general  com¬ 
bined  algorithm  specialized  to  the  quadratic  situation.  Next  we  note  that 

BrPi,  =  BrQ(x*  -  zk)  =  BrQ(x*  -  xk )  -  BrQBu*  (10.58) 

=  -BTgk  +  BQBr(BrQB)_1Brgjt  =  0, 

which  merely  proves  that  the  gradient  at  zk  is  orthogonal  to  N.  Next  we  calculate 

2{E(xk)  -  E(zk)}  =  ( xk  -  x*)TQ(xk  -  x*)  -  ( zk  -  x*)TQ(zk  -  x*) 

=  -2u[BrQ(x,  -  x*)  -  u[BrQBu, 

=  -2u[Brg,  +  ujBrQB(BrQB)_1Brgi. 

=  -uTkBTgk  =  gTkB(BTQBEBTgk. 


(10.59) 
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Then  we  compute 


2{E(zk)  -  £(x*+i)}  =  (zk  -  x*)rQ(zt  -  x*)  -  (x*+i  -  x*)rQ(xk+l  -  x*) 


=  ~2akplQ(zk  -  x*)  -  a2k p[Qpt 
=  2akpTkpk  -  a\ p[QpA 


=  akp'kPk  = 


(10.60) 


Now  using  (10.58)  and  =  -gk  -  QBu^  we  have 

2E(xk)  =  (x*  -  x*)TQ(xk  -  x*)  =  g[Q~lgk 

=  (p[  +  u[BrQ)Q-1(p,  +  QBu/c)  (10.61) 

=  pjQ-'p*  +  u'B'QBu, 

=  PaQ'Pa  +  g[B(BrQB)“1Brgi. 


Adding  (10.59)  and  (10.60)  and  dividing  by  (10.61)  there  results 

E(x k)  -  E(xk+ 1)  =  g[B(BrQB)~1Brgt.  +  (p[pA.)2/p[Qp/; 
E(*k)  '  p[Q1p,  +  g[B(BrQB)  >Brg, 

_  g  +  (p[p*)/(PArQPfc) 

<7  +  (p[Q_1P*)/(p[Pa)’ 


where  q  >  0.  This  has  the  form  (q  +  a)/(q  +  b)  with 


a  = 


P?Pfc 

p'Qp/ 


P/'p/ 


But  for  any  p^,  it  follows  that  a  <  b.  Hence 

q  +  a  a 

1 — r  >  t, 
q  +  b  b 

and  thus 

E(xk)  -  £(xt+1)  >  (p[Pt)2 

E(Xk)  "  (PirQPA)(PirQ'1PA-) 

Finally, 


E(xk+ 1)  <  E(xk) 


(PArP^-)2 

<p[Qp/)<p[Qp/) 


<  (1  -  5)£(xa), 


since  Brp;t  =  0. 1 
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The  value  5  associated  with  the  above  theorem  is  related  to  the  eigenvalue  struc¬ 
ture  of  Q.  If  p  were  allowed  to  vary  over  the  whole  space,  then  the  Kantorovich 
inequality 


(prp)2  4  a  A 

(prQp)(prQ"‘p)  >  («  +  A )2  ’ 


(10.62) 


where  a  and  A  are,  respectively,  the  smallest  and  largest  eigenvalues  of  Q,  gives 
explicitly 

~  4  aA 

o  = - -. 

(, a  +  A)2 


When  p  is  restricted  to  the  nullspace  of  Br,  the  corresponding  value  of  8  is  larger. 
In  some  special  cases  it  is  possible  to  obtain  a  fairly  explicit  estimate  of  5.  Suppose, 
for  example,  that  the  subspace  N  were  the  subspace  spanned  by  m  eigenvectors  of 
Q.  Then  the  subspace  in  which  p  is  allowed  to  vary  is  the  space  orthogonal  to  N  and 
is  thus,  in  this  case,  the  space  generated  by  the  other  n  -  m  eigenvectors  of  Q.  In 
this  case  since  for  p  in  N1-  (the  space  orthogonal  to  N),  both  Qp  and  Q_1p  are  also 
in  N1- ,  the  ratio  6  satisfies 


g  _  (Prp)2  >  4 aA 

~  (PrQp)(PrQ_1P)  "  («  +  A)2’ 

where  now  a  and  A  are,  respectively,  the  smallest  and  largest  of  the  n  -  m  eigen¬ 
values  of  Q  corresponding  to  N^.  Thus  the  convergence  ratio  (10.57)  reduces  to  the 
familiar  form 

E(xk+i)  <  — -)  E(xk), 

\A  +  a) 

where  a  and  A  are  these  special  eigenvalues.  Thus,  if  B,  or  equivalently  N,  is  chosen 
to  include  the  eigenvectors  corresponding  to  the  most  undesirable  eigenvalues  of  Q, 
the  convergence  rate  of  the  combined  method  will  be  quite  attractive. 


Applications 

The  combination  of  steepest  descent  and  Newton’s  method  can  be  applied  usefully 
in  a  number  of  important  situations.  Suppose,  for  example,  we  are  faced  with  a 
problem  of  the  form 

minimize  /(x,  y), 

where  x  e  En ,  ye  Em ,  and  where  the  second  partial  derivatives  with  respect  to  x  are 
easily  computable  but  those  with  respect  to  y  are  not.  We  may  then  employ  Newton 
steps  with  respect  to  x  and  steepest  descent  with  respect  to  y. 

Another  instance  where  this  idea  can  be  greatly  effective  is  when  there  are  a  few 
vital  variables  in  a  problem  which,  being  assigned  high  costs,  tend  to  dominate  the 
value  of  the  objective  function;  in  other  words,  the  partial  second  derivatives  with 
respect  to  these  variables  are  large.  The  poor  conditioning  induced  by  these  variables 
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can  to  some  extent  be  reduced  by  proper  scaling  of  variables,  but  more  effectively, 
by  carrying  out  Newton’s  method  with  respect  to  them  and  steepest  descent  with 
respect  to  the  others. 


10.9  Summary 

The  basic  motivation  behind  quasi-Newton  methods  is  to  try  to  obtain,  at  least  on 
the  average,  the  rapid  convergence  associated  with  Newton’s  method  without  explic¬ 
itly  evaluating  the  Hessian  at  every  step.  This  can  be  accomplished  by  constructing 
approximations  to  the  inverse  Hessian  based  on  information  gathered  during  the  de¬ 
scent  process,  and  results  in  methods  which  viewed  in  blocks  of  n  steps  (where  n  is 
the  dimension  of  the  problem)  generally  possess  superlinear  convergence. 

Good,  or  even  superlinear,  convergence  measured  in  terms  of  large  blocks,  how¬ 
ever,  is  not  always  indicative  of  rapid  convergence  measured  in  terms  of  individual 
steps.  It  is  important,  therefore,  to  design  quasi-Newton  methods  so  that  their  single 
step  convergence  is  rapid  and  relatively  insensitive  to  line  search  inaccuracies.  We 
discussed  two  general  principles  for  examining  these  aspects  of  descent  algorithms. 
The  first  of  these  is  the  modified  Newton  method  in  which  the  direction  of  descent 
is  taken  as  the  result  of  multiplication  of  the  negative  gradient  by  a  positive  def¬ 
inite  matrix  S.  The  single  step  convergence  ratio  of  this  method  is  determined  by 
the  usual  steepest  descent  formula,  but  with  the  condition  number  of  SF  rather  than 
just  F  used.  This  result  was  used  to  analyze  some  popular  quasi-Newton  methods, 
to  develop  the  self-scaling  method  having  good  single  step  convergence  properties, 
and  to  reexamine  conjugate  gradient  methods. 

The  second  principle  method  is  the  combined  method  in  which  Newton’s  method 
is  executed  over  a  subspace  where  the  Hessian  is  known  and  steepest  descent  is 
executed  elsewhere.  This  method  converges  at  least  as  fast  as  steepest  descent,  and 
by  incorporating  the  information  gathered  as  the  method  progresses,  the  Newton 
portion  can  be  executed  over  larger  and  larger  subspaces. 

At  this  point,  it  is  perhaps  valuable  to  summarize  some  of  the  main  themes  that 
have  been  developed  throughout  the  four  chapters  comprising  Part  II.  These  chap¬ 
ters  contain  several  important  and  popular  algorithms  that  illustrate  the  range  of 
possibilities  available  for  minimizing  a  general  nonlinear  function.  From  a  broad 
perspective,  however,  these  individual  algorithms  can  be  considered  simply  as  spe¬ 
cific  patterns  on  the  analytical  fabric  that  is  woven  through  the  chapters — the  fabric 
that  will  support  new  algorithms  and  future  developments. 

One  unifying  element,  that  has  reproved  its  value  several  times,  is  the  Global 
Convergence  Theorem.  This  result  helped  mold  the  final  form  of  every  algorithm 
presented  in  Part  II  and  has  effectively  resolved  the  major  questions  concerning 
global  convergence. 

Another  unifying  element  is  the  speed  of  convergence  of  an  algorithm,  which 
we  have  defined  in  terms  of  the  asymptotic  properties  of  the  sequences  an  algo¬ 
rithm  generates.  Initially,  it  might  have  been  argued  that  such  measures,  based  on 
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properties  of  the  tail  of  the  sequence,  are  perhaps  not  truly  indicative  of  the  ac¬ 
tual  time  required  to  solve  a  problem — after  all,  a  sequence  generated  in  practice 
is  a  truncated  version  of  the  potentially  infinite  sequence,  and  asymptotic  proper¬ 
ties  may  not  be  representative  of  the  finite  version — a  more  complex  measure  of  the 
speed  of  convergence  may  be  required.  It  is  fair  to  demand  that  the  validity  of  the 
asymptotic  measures  we  have  proposed  be  judged  in  terms  of  how  well  they  pre¬ 
dict  the  performance  of  algorithms  applied  to  specific  examples.  On  this  basis,  as 
illustrated  by  the  numerical  examples  presented  in  these  chapters,  and  on  others,  the 
asymptotic  rates  are  extremely  reliable  predictors  of  performance — provided  that 
one  carefully  tempers  one’s  analysis  with  common  sense  (by,  for  example,  not  con¬ 
cluding  that  superlinear  convergence  is  necessarily  superior  to  linear  convergence 
when  the  superlinear  convergence  is  based  on  repeated  cycles  of  length  n).  A  ma¬ 
jor  conclusion,  therefore,  of  the  previous  chapters  is  the  essential  validity  of  the 
asymptotic  approach  to  convergence  analysis.  This  conclusion  is  a  major  strand  in 
the  analytical  fabric  of  nonlinear  programming. 


10.10  Exercises 

1.  Prove  (10.4)  directly  for  the  modified  Newton  method  by  showing  that  each 
step  of  the  modified  Newton  method  is  simply  the  ordinary  method  of  steepest 
descent  applied  to  a  scaled  version  of  the  original  problem. 

2.  Find  the  rate  of  convergence  of  the  version  of  Newton’s  method  defined  by 
(10.50),  (10.51)  of  Chap.  8.  Show  that  convergence  is  only  linear  if  8  is  larger 
than  the  smallest  eigenvalue  of  F(x*). 

3.  Consider  the  problem  of  minimizing  a  quadratic  function 

/(X)  =  lxrQx  -  xrb, 

where  Q  is  symmetric  and  sparse  (that  is,  there  are  relatively  few  nonzero  en¬ 
tries  in  Q).  The  matrix  Q  has  the  form 

Q  =  I  +  V, 

where  I  is  the  identity  and  V  is  a  matrix  with  eigenvalues  bounded  by  e  <  1  in 
magnitude. 

(a)  With  the  given  information,  what  is  the  best  bound  you  can  give  for  the  rate  of 
convergence  of  steepest  descent  applied  to  this  problem? 

(b)  In  general  it  is  difficult  to  invert  Q  but  the  inverse  can  be  approximated  by  I-V, 
which  is  easy  to  calculate.  (The  approximation  is  very  good  for  small  e .)  We  are 
thus  led  to  consider  the  iterative  process 


xk-i  =  xk-  ak[ I  -  V]g*, 
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where  =  Qx,  -  b  and  ak  is  chosen  to  minimize  /  in  the  usual  way.  With 
the  information  given,  what  is  the  best  bound  on  the  rate  of  convergence  of  this 
method? 

(c)  Show  that  fore  <  (V5-l)/2  the  method  in  part  (b)  is  always  superior  to 
steepest  descent. 

4.  This  problem  shows  that  the  modified  Newton’s  method  is  globally  convergent 
under  very  weak  assumptions. 

Let  a  >  0  and  b  >  a  be  given  constants.  Consider  the  collection  P  of  all  n  x  n 
symmetric  positive  definite  matrices  P  having  all  eigenvalues  greater  than  or 
equal  to  a  and  all  elements  bounded  in  absolute  value  by  b.  Define  the  point- 
to-set  mapping  B  :  En  — >  En+n~  by  B(x)  =  {(x,  P)  :  P  e  P}.  Show  that  B  is  a 
closed  mapping. 

Now  given  an  objective  function  /  e  C1,  consider  the  iterative  algorithm 

X/c+]  —  X£  —  ttjfcPfcgfc, 

where  gk  =  g(x^)  is  the  gradient  of  /  at  x^,  P^  is  any  matrix  from  P  and  ak 
is  chosen  to  minimize  /(x^+i).  This  algorithm  can  be  represented  by  A  which 
can  be  decomposed  as  A  =  SCB  where  B  is  defined  above,  C  is  defined  by 
C(x,  P)  =  (x,  -Pg(x)),  and  S  is  the  standard  line  search  mapping.  Show  that  if 
restricted  to  a  compact  set  in  E",  the  mapping  A  is  closed. 

Assuming  that  a  sequence  {x^}  generated  by  this  algorithm  is  bounded,  show 
that  the  limit  x*  of  any  convergent  subsequence  satisfies  g(x*)  =  0. 

5.  The  following  algorithm  has  been  proposed  for  minimizing  unconstrained  func¬ 
tions  /(x),  x  £  En,  without  using  gradients:  Starting  with  some  arbitrary  point 
xq,  obtain  a  direction  of  search  such  that  for  each  component  of 


f(\k  =  (d*)je,-)  =  min  f(xk  + 

di 

where  ey-  denotes  the  ith  column  of  the  identity  matrix.  In  other  words,  the  ith 
component  of  d^  is  determined  through  a  line  search  minimizing  /(x)  along  the 
ith  component. 

The  next  point  x^+i  is  then  determined  in  the  usual  way  through  a  line  search 
along  d^;  that  is, 

Xfc+i  —  x^  +  ak&k-> 

where  d^  minimizes  /(x^+i). 

(a)  Obtain  an  explicit  representation  for  the  algorithm  for  the  quadratic  case  where 

/(x)  =  l(x  -  x*)rQ(x  -  x*)  +  /(x*). 

(b)  What  condition  on  /(x)  or  its  derivatives  will  guarantee  descent  of  this  algo¬ 
rithm  for  general  /(x)? 

(c)  Derive  the  convergence  rate  of  this  algorithm  (assuming  a  quadratic  objective). 
Express  your  answer  in  terms  of  the  condition  number  of  some  matrix. 
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6.  Suppose  that  the  rank  one  correction  method  of  Sect.  10.2  is  applied  to  the 
quadratic  problem  (10.2)  and  suppose  that  the  matrix  Ro  =  F1/2HoF1/2  has 
m  <  n  eigenvalues  less  than  unity  and  n  -  m  eigenvalues  greater  than  unity. 
Show  that  the  condition  ql(pk  -  H^q*.)  >  0  will  be  satisfied  at  most  m  times 
during  the  course  of  the  method  and  hence,  if  updating  is  performed  only  when 
this  condition  holds,  the  sequence  {H^}  will  not  converge  to  F-1.  Infer  from  this 
that,  in  using  the  rank  one  correction  method,  Ho  should  be  taken  very  small; 
but  that,  despite  such  a  precaution,  on  nonquadratic  problems  the  method  is 
subject  to  difficulty. 

7.  Show  that  if  Ho  =  I  the  Davidon-Fletcher-Powell  method  is  the  conjugate  gra¬ 
dient  method.  What  similar  statement  can  be  made  when  Ho  is  an  arbitrary 
symmetric  positive  definite  matrix? 

8.  In  the  text  it  is  shown  that  for  the  Davidon-Fletcher-Powell  method  H^+i  is  pos¬ 
itive  definite  if  H^  is.  The  proof  assumed  that  ak  is  chosen  to  exactly  minimize 
f(xk  +  adk).  Show  that  any  ak  >  0  which  leads  to  p^q k  >  0  will  guarantee 
the  positive  definiteness  of  H^+i.  Show  that  for  a  quadratic  problem  any  ak  ±  0 
leads  to  a  positive  definite  H^+i . 

9.  Suppose  along  the  line  Xk  +  adk,  a  >  0,  the  function  f(xk  +  adk)  is  unimodal  and 

differentiable.  Let  ak  be  the  minimizing  value  of  a.  Show  that  if  any  ak  >  ak  is 
selected  to  define  x^+i  =  +  akdk,  then  p^q^  >  0.  (Refer  to  Sect.  10.3.) 

10.  Let  {H/:},  k  =  0,  1,  2  ...  be  the  sequence  of  matrices  generated  by  the  Davidon- 
Fletcher-Powell  method  applied,  without  restarting,  to  a  function  /  having  con¬ 
tinuous  second  partial  derivatives.  Assuming  that  there  is  a  >  0,  A  >  0  such 
that  for  all  k  we  have  H^  -  al  and  AI  -  H^  positive  definite  and  the  corresponding 
sequence  of  x^’s  is  bounded,  show  that  the  method  is  globally  convergent. 

11.  Verify  Eq.  (10.41). 

12. 

(a)  Show  that  starting  with  the  rank  one  update  formula  for  H,  forming  the  com¬ 
plementary  formula,  and  then  taking  the  inverse  restores  the  original  formula. 

(b)  What  value  of  0  in  the  Broyden  class  corresponds  to  the  rank  one  formula? 

13.  Explain  how  the  partial  Davidon  method  can  be  implemented  for  ra  <  n/2,  with 
less  storage  than  required  by  the  full  method. 

14.  Prove  statements  (10.1)  and  (10.2)  below  Eq.  (10.46)  in  Sect.  10.6. 

15.  Consider  using 

,  p[h;;'p/{ 

yk  =  — f - 

Pa  <1/ 

instead  of  (10.47). 

(a)  Show  that  this  also  serves  as  a  suitable  scale  factor  for  a  self- scaling  quasi- 
Newton  method. 

(b)  Extend  part  (a)  to 


jk  =  (1  ~4>) 


P-(l/ 

q,'H/<iA 


P/'c1a 


for  0  <  (f>  <  1 . 
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16.  Prove  global  convergence  of  the  combination  of  steepest  descent  and  Newton’s 
method. 

17.  Formulate  a  rate  of  convergence  theorem  for  the  application  of  the  combination 
of  steepest  and  Newton’s  method  to  nonquadratic  problems. 

18.  Prove  that  if  Q  is  positive  definite 

(prp)  p7Qp 
prQp  PrP 


for  any  vector  p. 

19.  It  is  possible  to  combine  Newton’s  method  and  the  partial  conjugate  gradient 
method.  Given  a  subspace  N  c  En,  xk+i  is  generated  from  by  first  finding 
zk  by  taking  a  Newton  step  in  the  linear  variety  through  xk  parallel  to  N,  and 
then  taking  m  conjugate  gradient  steps  from  zk.  What  is  a  bound  on  the  rate  of 
convergence  of  this  method? 

20.  In  this  exercise  we  explore  how  the  combined  method  of  Sect.  10.7  can  be 
updated  as  more  information  becomes  available.  Begin  with  No  =  {0}.  If  Nk  is 
represented  by  the  corresponding  matrix  B^,  define  Nk+\  by  the  corresponding 
Bfc+i  =  [B^,  p^],  where  =  x^+i  -  zk. 

(a)  If  D,  =  B*(B[FB*)-  !Bj  is  known,  show  that 


D&+i  =  D/v  = 


(P k  -  D^|/'-KP/.-  ~  Dtq kf 
(P k  ~  Dtqk)Tqk 


where  qyt  =  g^+1  -  g^.  (This  is  the  rank  one  correction  of  Sect.  10.2.) 

(b)  Develop  an  algorithm  that  uses  (a)  in  conjunction  with  the  combined  method  of 
Sect.  10.8  and  discuss  its  convergence  properties. 
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ysis  of  the  one-by-one  shift  of  the  eigenvalues  to  unity  is  contained 
in  Fletcher  [F6].  The  scaling  concept,  including  the  self-scaling  algo¬ 
rithm,  is  due  to  Oren  and  Luenberger  [05].  Also  see  Oren  [04].  The 
two-parameter  class  of  updates  defined  by  the  scaling  procedure  can  be 
shown  to  be  equivalent  to  the  symmetric  Huang  class.  Oren  and  Spedi- 
cato  [06]  developed  a  procedure  for  selecting  the  scaling  parameter  so 
as  to  optimize  the  condition  number  of  the  update. 

10.7  The  idea  of  expressing  conjugate  gradient  methods  as  update  formulae  is 
due  to  Perry  [P3].  The  development  of  the  form  presented  here  is  due  to 
Shanno  [S4].  Preconditioning  for  conjugate  gradient  methods  was  sug¬ 
gested  by  Bertsekas  [B9]. 

10.8  The  combined  method  appears  in  Luenberger  [L10]. 
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Chapter  11 

Constrained  Minimization  Conditions 


We  turn  now,  in  this  final  part  of  the  book,  to  the  study  of  minimization  problems 
having  constraints.  We  begin  by  studying  in  this  chapter  the  necessary  and  sufficient 
conditions  satisfied  at  solution  points.  These  conditions,  aside  from  their  intrinsic 
value  in  characterizing  solutions,  define  Lagrange  multipliers  and  a  certain  Hessian 
matrix  which,  taken  together,  form  the  foundation  for  both  the  development  and 
analysis  of  algorithms  presented  in  subsequent  chapters. 

The  general  method  used  in  this  chapter  to  derive  necessary  and  sufficient  condi¬ 
tions  is  a  straightforward  extension  of  that  used  in  Chap.  7  for  unconstrained  prob¬ 
lems.  In  the  case  of  equality  constraints,  the  feasible  region  is  a  curved  surface 
embedded  in  En .  Differential  conditions  satisfied  at  an  optimal  point  are  derived  by 
considering  the  value  of  the  objective  function  along  curves  on  this  surface  passing 
through  the  optimal  point.  Thus  the  arguments  run  almost  identically  to  those  for  the 
unconstrained  case;  families  of  curves  on  the  constraint  surface  replacing  the  ear¬ 
lier  artifice  of  considering  feasible  directions.  There  is  also  a  theory  of  zero-order 
conditions  that  is  presented  in  the  final  section  of  the  chapter. 


11.1  Constraints 


We  deal  with  general  nonlinear  programming  problems  of  the  form 

minimize  /(x) 

subject  to  hi(x)  =  0,  gi(x)  <  0 
h2(x)  =  0,  g2(x)  <  0 


hm(x)  =  0,  gp(x)  <  0 
x  e  Q  c  En, 


(H.l) 
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where  m  <  n  and  the  functions  /,  hi,i  -  1, 2, . . . ,  m  and  gj,  j  =  1, 2,  . . . ,  p  are  con¬ 
tinuous,  and  usually  assumed  to  possess  continuous  second  partial  derivatives.  For 
notational  simplicity,  we  introduce  the  vector- valued  functions  h  =  (/zi,  /*2,  . . . ,  hm) 
and  g  =  (g i,  g2,  . . . ,  gp)  and  rewrite  (11.1)  as 

minimize  /(x) 

subject  to  h(x)  =  0,  g(x)  <  0  (11.2) 

x  e  Q. 

The  constraints  h(x)  =  0,  g(x)  <  0  are  referred  to  as  functional  constraints ,  while 
the  constraint  x  e  Q  is  a  set  constraint.  As  before  we  continue  to  de-emphasize  the 
set  constraint,  assuming  in  most  cases  that  either  Q  is  the  whole  space  En  or  that  the 
solution  to  (1 1.2)  is  in  the  interior  of  Q.  A  point  x  e  Q  that  satisfies  all  the  functional 
constraints  is  said  to  b e  feasible. 

A  fundamental  concept  that  provides  a  great  deal  of  insight  as  well  as  simplifying 
the  required  theoretical  development  is  that  of  an  active  constraint.  An  inequality 
constraint  gfx)  <  0  is  said  to  be  active  at  a  feasible  point  x  if  gfx)  =  0  and  inactive 
at  x  if  gt(x)  <  0.  By  convention  we  refer  to  any  equality  constraint  hfx)  =  0  as  active 
at  any  feasible  point.  The  constraints  active  at  a  feasible  point  x  restrict  the  domain 
of  feasibility  in  neighborhoods  of  x,  while  the  other,  inactive  constraints,  have  no 
influence  in  neighborhoods  of  x.  Therefore,  in  studying  the  properties  of  a  local 
minimum  point,  it  is  clear  that  attention  can  be  restricted  to  the  active  constraints. 
This  is  illustrated  in  Fig.  11.1  where  local  properties  satisfied  by  the  solution  x* 
obviously  do  not  depend  on  the  inactive  constraints  g2  and  g3 . 

It  is  clear  that,  if  it  were  known  a  priori  which  constraints  were  active  at  the  solu¬ 
tion  to  (11.1),  the  solution  would  be  a  local  minimum  point  of  the  problem  defined 
by  ignoring  the  inactive  constraints  and  treating  all  active  constraints  as  equality 
constraints.  Hence,  with  respect  to  local  (or  relative)  solutions,  the  problem  could 
be  regarded  as  having  equality  constraints  only.  This  observation  suggests  that  the 


Fig.  11.1  Example  of  inactive  constraints 


11.2  Tangent  Plane 
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majority  of  insight  and  theory  applicable  to  (1 1 . 1)  can  be  derived  by  consideration  of 
equality  constraints  alone,  later  making  additions  to  account  for  the  selection  of  the 
active  constraints.  This  is  indeed  so.  Therefore,  in  the  early  portion  of  this  chapter 
we  consider  problems  having  only  equality  constraints,  thereby  both  economizing 
on  notation  and  isolating  the  primary  ideas  associated  with  constrained  problems. 
We  then  extend  these  results  to  the  more  general  situation. 


11.2  Tangent  Plane 

A  set  of  equality  constraints  on  En 


h\ (x)  =  0 
h2(x)  =  0 

.  (H.3) 

Mx)  =  o 


defines  a  subset  of  En  which  is  best  viewed  as  a  hypersurface.  If  the  constraints 
are  everywhere  regular,  in  a  sense  to  be  described  below,  this  hypersurface  is  of 
dimension  n  -  m.  If,  as  we  assume  in  this  section,  the  functions  hi ,  i  =  1,2,  . . . ,  m 
belong  to  C1,  the  surface  defined  by  them  is  said  to  be  smooth. 

Associated  with  a  point  on  a  smooth  surface  is  the  tangent  plane  at  that  point, 
a  term  which  in  two  or  three  dimensions  has  an  obvious  meaning.  To  formalize 
the  general  notion,  we  begin  by  defining  curves  on  a  surface.  A  curve  on  a  surface 
S  is  a  family  of  points  x{t)  e  S  continuously  parameterized  by  t  for  a  <  t  <  b. 
The  curve  is  differentiable  if  x  =  (d/d t)x(t)  exists,  and  is  twice  differentiable  if 
x(t)  exists.  A  curve  x(t)  is  said  to  pass  through  the  point  x*  if  x*  =  x(f)  for  some 
t* ,  a  <  f  <  b.  The  derivative  of  the  curve  at  x*  is,  of  course,  defined  as  x(f).  It  is 
itself  a  vector  in  En . 

Now  consider  all  differentiable  curves  on  S  passing  through  a  point  x*.  The  tan¬ 
gent  plane  at  x*  is  defined  as  the  collection  of  the  derivatives  at  x*  of  all  these 
differentiable  curves.  The  tangent  plane  is  a  subspace  of  En. 

For  surfaces  defined  through  a  set  of  constraint  relations  such  as  (1 1.3),  the  prob¬ 
lem  of  obtaining  an  explicit  representation  for  the  tangent  plane  is  a  fundamental 
problem  that  we  now  address.  Ideally,  we  would  like  to  express  this  tangent  plane 
in  terms  of  derivatives  of  functions  hi  that  define  the  surface.  We  introduce  the  sub¬ 
space 

M  =  { y  :  Vh(x*)y  =  0} 

and  investigate  under  what  conditions  M  is  equal  to  the  tangent  plane  at  x*.  The  key 
concept  for  this  purpose  is  that  of  a  regular  point.  Figure  1 1.2  shows  some  examples 
where  for  visual  clarity  the  tangent  planes  (which  are  sub- spaces)  are  translated  to 
the  point  x*. 
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htx)  =  0 


(a) 


Vh(x*f 


Yh(x*)r 


i(v*) 


Fig.  11.2  Three  examples  of  tangent  planes  (translated  to  x*) 
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Definition.  A  point  x*  satisfying  the  constraint  h(x*)  =  0  is  said  to  be  a  regular  point  of  the 
constraint  if  the  gradient  vectors  V/zffx*),  V/z2(x*),  . . . ,  Vhm(x*)  are  linearly  independent. 

Note  that  if  h  is  affine,  h(x)  =  Ax  +  b,  regularity  is  equivalent  to  A  having  rank  equal 
to  m,  and  this  condition  is  independent  of  x. 

In  general,  at  regular  points  it  is  possible  to  characterize  the  tangent  plane  in 
terms  of  the  gradients  of  the  constraint  functions. 

Theorem.  At  a  regular  point  x*  of  the  surface  S  defined  by  h(x)  =  0  the  tangent  plane  is 
equal  to 

M  =  {y  :  Vh(x*)y  =  0}. 

Proof.  Let  T  be  the  tangent  plane  at  x*.  It  is  clear  that  T  c  M  whether  x*  is  regular 
or  not,  for  any  curve  x(t)  passing  through  x*  at  t  —  f  having  derivative  x(f )  such 
that  Vh(x*)x(f*)  A  0  would  not  lie  on  S . 

To  prove  that  M  c  T  we  must  show  that  if  y  e  M  then  there  is  a  curve  on 
S  passing  through  x*  with  derivative  y.  To  construct  such  a  curve  we  consider  the 
equations 

h(x*  +  ty  +  Vh(x*)ru(0)  =  0,  (1 1.4) 

where  for  fixed  t  we  consider  u(t)  e  Em  to  be  the  unknown.  This  is  a  nonlinear 
system  of  m  equations  and  m  unknowns,  parameterized  continuously,  by  t.  At  t  =  0 
there  is  a  solution  u(0)  =  0.  The  Jacobian  matrix  of  the  system  with  respect  to  u  at 
t  -  0  is  the  mx  m  matrix 

Vh(x*)Vh(x*)r, 

which  is  nonsingular,  since  Vh(x*)  is  of  full  rank  if  x*  is  a  regular  point.  Thus,  by  the 
Implicit  Function  Theorem  (see  Appendix  A)  there  is  a  continuously  differentiable 
solution  u (t)  in  some  region  -a  <  t  <  a. 

The  curve  x(t)  =  x*  +  ty  +  Vh(x*)ru(0  is  thus,  by  construction,  a  curve  on  S .  By 
differentiating  the  system  (11.4)  with  respect  to  t  at  t  =  0  we  obtain 


0  =  fh  (X(f» 

at 


L=0 


Vh(x*)y  +  Vh(x*)Vh(x*)ru(0). 


By  definition  of  y  we  have  Vh(x*)y  =  0  and  thus,  again  since  Vh(x*  )Vh(x*  )T  is 
nonsingular,  we  conclude  that  x(0)  =  0.  Therefore 


x(0)  =  y  +  Vh(x*)rx(0)  =  y, 

and  the  constructed  curve  has  derivative  y  at  x*.  I 

It  is  important  to  recognize  that  the  condition  of  being  a  regular  point  is  not  a 
condition  on  the  constraint  surface  itself  but  on  its  representation  in  terms  of  an  h. 
The  tangent  plane  is  defined  independently  of  the  representation,  while  M  is  not. 

Example.  In  E 2  let  h(x i ,  X2)  =  x\ .  Then  h(x)  =  0  yields  the  V2  axis,  and  every  point 
on  that  axis  is  regular.  If  instead  we  put  h(x  1,  xfi)  -  x\,  again  S  is  the  *2  axis  but 
now  no  point  on  the  axis  is  regular.  Indeed  in  this  case  M  -  E2,  while  the  tangent 
plane  is  the  *2  axis. 
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11.3  First-Order  Necessary  Conditions  (Equality  Constraints) 


The  derivation  of  necessary  and  sufficient  conditions  for  a  point  to  be  a  local  min¬ 
imum  point  subject  to  equality  constraints  is  fairly  simple  now  that  the  representa¬ 
tion  of  the  tangent  plane  is  known.  We  begin  by  deriving  the  first-order  necessary 
conditions. 


Lemma.  Let  x*  be  a  regular  point  of  the  constraints  h(x)  =  0  and  a  local  extremum 
point  (a  minimum  or  maximum)  off  subject  to  these  constraints. 


Then  all  y  e  En  satisfying 


Vh(x*)y  =  0 


must  also  satisfy 


V/(x*)y  =  0. 


(11.5) 

(11.6) 


Proof  Let  y  be  any  vector  in  the  tangent  plane  at  x*  and  let  x(t)  be  any  smooth 
curve  on  the  constraint  surface  passing  through  x*  with  derivative  y  at  x* ;  that  is, 
x(0)  =  x*,  x(0)  =  y,  and  h(x(t))  =  0  for  -a  <  t  <  a  for  some  a  >  0. 

Since  x*  is  a  regular  point,  the  tangent  plane  is  identical  with  the  set  of  y’s  sat¬ 
isfying  Vh(x*)y  =  0.  Then,  since  x*  is  a  constrained  local  extremum  point  of  /,  we 
have 


d_ 

dt 


/(x(0) 


/=0 


or  equivalently, 

V/(x*)y  =  0. 1 


The  above  Lemma  says  that  V/(x*)  is  orthogonal  to  the  tangent  plane.  Next  we 
conclude  that  this  implies  that  V/(x*)  is  a  linear  combination  of  the  gradients  of  h 
at  x*,  a  relation  that  leads  to  the  introduction  of  Lagrange  multipliers.  As  in  much 
of  nonlinear  programming,  the  Lagrange  multiplier  vector  is  often  labeled  A  rather 
than  y  in  linear  programming,  and  this  convention  is  followed  here. 


Theorem.  Let  x*  be  a  local  extremum  point  off  subject  to  the  constraints  h(x)  =  0.  Assume 
further  that  x*  is  a  regular  point  of  these  constraints.  Then  there  is  a  A  £  Em  such  that 

Vf(x*)  +  ATVh(x*)  =  0.  (11.7) 


Proof.  From  the  Lemma  we  may  conclude  that  the  value  of  the  linear  program 

maximize  V/(x*)y 
subject  to  Vh(x*)y  =  0 

is  zero.  Thus,  by  the  Duality  Theorem  of  linear  programming  (Sect.  4.2)  the  dual 
problem  is  feasible.  Specifically,  there  is  A  e  Em  such  that  V/(x*)  +  TrVh(x*) 

=  0.1 
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It  should  be  noted  that  the  first-order  necessary  conditions 

V/(x*)  +  ArVh(x*)  =  0 


together  with  the  constraints 

h(x*)  =  0 

give  a  total  of  n+m  (generally  nonlinear)  equations  in  the  n+m  variables  comprising 
x*,  A.  Thus  the  necessary  conditions  are  a  complete  set  since,  at  least  locally,  they 
determine  a  unique  solution. 

It  is  convenient  to  introduce  the  Lagrangian  associated  with  the  constrained  prob¬ 
lem,  defined  as 

/(x,  A)  =  /(x)  +  Arh(x).  (11.8) 

The  necessary  conditions  can  then  be  expressed  in  the  form 

Vx/(x,  A)  =  0  (11.9) 

VA/(x,  A)  =  0,  (11.10) 

the  second  of  these  being  simply  a  restatement  of  the  constraints. 


11.4  Examples 

We  digress  briefly  from  our  mathematical  development  to  consider  some  examples 
of  constrained  optimization  problems.  We  present  five  simple  examples  that  can 
be  treated  explicitly  in  a  short  space  and  then  briefly  discuss  a  broader  range  of 
applications. 

Example  1.  Consider  the  problem 

minimize  X\X2  +  V2V3  +  x\x^ 
subject  to  x\  +  X2  +  V3  =  3. 

The  necessary  conditions  become 


X2  +  V3  +  A  =0 
X\  +  V3  +  A  =0 
ii  +  X2  +  A  =0. 

These  three  equations  together  with  the  one  constraint  equation  give  four  equations 
that  can  be  solved  for  the  four  unknowns  x\,  X2,  *3,  A.  Solution  yields  x\  -  X2  - 
X3  =  1,  A  —  -2. 

Example  2  ( Maximum  Volume).  Let  us  consider  an  example  of  the  type  that  is  now 
standard  in  textbooks  and  which  has  a  structure  similar  to  that  of  the  example  above. 
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We  seek  to  construct  a  cardboard  box  of  maximum  volume,  given  a  fixed  area  of 
cardboard. 

Denoting  the  dimensions  of  the  box  by  x,  y,  z,  the  problem  can  be  expressed  as 

maximize  xyz 

c 

subject  to  (xy  +  yz  +  xz)  =  - ,  (11.11) 

where  c  >  0  is  the  given  area  of  cardboard.  Introducing  a  Lagrange  multiplier,  the 
first-order  necessary  conditions  are  easily  found  to  be 

yz  +  A(y  +  z)  =0 

xz  +  A(x  +  z)  =  0  (11.12) 

xy  +  A(x  +  y)  =  0 

together  with  the  constraint.  Before  solving  these,  let  us  note  that  the  sum  of  these 
equations  is  (xy  +  yz  +  xz)  +  2T(x  +  y  +  z)  =  0.  Using  the  constraint  this  becomes 
c/2  +  2A(x  +  y  +  z)  =  0.  From  this  it  is  clear  that  A  ±  0.  Now  we  can  show  that 
x,  y,  and  z  are  nonzero.  This  follows  because  x  =  0  implies  z  =  0  from  the  second 
equation  and  y  =  0  from  the  third  equation.  In  a  similar  way,  it  is  seen  that  if  either 
x,  y,  or  z  are  zero,  all  must  be  zero,  which  is  impossible. 

To  solve  the  equations,  multiply  the  first  by  x  and  the  second  by  y,  and  then 
subtract  the  two  to  obtain 

A(x  -  y)z  =  0. 

Operate  similarly  on  the  second  and  third  to  obtain 

A(y  -  z)x  =  0. 

Since  no  variables  can  be  zero,  it  follows  that  x  =  y  =  z  =  y/c/6  is  the  unique 
solution  to  the  necessary  conditions.  The  box  must  be  a  cube. 

Example  3  (Entropy),  optimization  problems  often  describe  natural  phenomena.  An 
example  is  the  characterization  of  naturally  occurring  probability  distributions  as 
maximum  entropy  distributions. 

As  a  specific  example  consider  a  discrete  probability  density  corresponding  to  a 
measured  value  taking  one  of  n  values  xi,  X2,  . . . ,  xn.  The  probability  associated 

n 

with  Xi  is  pi.  The  pi" s  satisfy  pi  ^  0  and  Y  Pi  -  1- 

i=i 

The  entropy  of  such  a  density  is 


n 

s  =  ~YPi 

i=  1 
n 

The  mean  value  of  the  density  is  Y  xiPi  • 


11.4  Examples 


329 


If  the  value  of  mean  is  known  to  be  m  (by  the  physical  situation),  the  maximum 
entropy  argument  suggests  that  the  density  should  be  taken  as  that  which  solves  the 
following  problem: 


n 

-  y]  pi  logo,) 

1=1 

n 

Yj  Pi  =  1  (11-13) 

i- 1 

n 

2_j  XiPi  =  m 
i=  1 

Pi  >  0,  i  =  1, 2, . . . ,  n. 

We  begin  by  ignoring  the  nonnegativity  constraints,  believing  that  they  may  be 
inactive.  Introducing  two  Lagrange  multipliers,  A  and  /i,  the  Lagrangian  is 

n 

l  =  2_}~Pi  lo2  Pi  +  Api  +  pXiP‘}  ~A  ~pm- 

i=  1 

The  necessary  conditions  are  immediately  found  to  be 

-  log  pi  -  1  +  A  +  pxi  =  0,  i  =  1, 2,  . . . ,  n. 

This  leads  to 

Pi  =  exp{(/t  -  1)  +  px^,  i=l,2,...,n.  (11.14) 

We  note  that  pi  >  0,  so  the  nonnegativity  constraints  are  indeed  inactive.  The  re¬ 
sult  (11.14)  is  known  as  an  exponential  density.  The  Lagrange  multipliers  A  and  p 
are  parameters  that  must  be  selected  so  that  the  two  equality  constraints  are  satisfied. 


maximize 


subject  to 


Example  4  (Hanging  Chain).  A  chain  is  suspended  from  two  thin  hooks  that  are 
16  ft  apart  on  a  horizontal  line  as  shown  in  Fig.  11.3.  The  chain  itself  consists  of 
20  links  of  stiff  steel.  Each  link  is  one  foot  in  length  (measured  inside).  We  wish  to 
formulate  the  problem  to  determine  the  equilibrium  shape  of  the  chain. 

The  solution  can  be  found  by  minimizing  the  potential  energy  of  the  chain.  Let 
us  number  the  links  consecutively  from  1  to  20  starting  with  the  left  end.  We  let  link 
i  span  an  v  distance  of  Xi  and  a  y  distance  of  y*.  Then  x2t  +  y2t  -  1.  The  potential 
energy  of  a  link  is  its  weight  times  its  vertical  height  (from  some  reference).  The 
potential  energy  of  the  chain  is  the  sum  of  the  potential  energies  of  each  link.  We 
may  take  the  top  of  the  chain  as  reference  and  assume  that  the  mass  of  each  link  is 
concentrated  at  its  center.  Assuming  unit  weight,  the  potential  energy  is  then 
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Fig.  11.3  A  hanging  chain 


1 


1 


1 


2^i  +  \yi +  2y2  +  \yi  +  +  2ys  1  +  *  *  * 


1 


+  \yi  +  y2  +  •  •  •  +  yn-\  +  —yn 


yh 


where  n  -  20  in  our  example. 

The  chain  is  subject  to  two  constraints:  The  total  y  displacement  is  zero,  and  the 
total  x  displacement  is  16.  Thus  the  equilibrium  shape  is  the  solution  of 


minimize 


subject  to 


The  first-order  necessary  conditions  are 

•  1  \  ,  /'Vi  _ 

l]  ^ 
for  i  =  1,2,  . . . ,  n.  This  leads  directly  to 

yl  —  i  +  ^  +  A 


(11.15) 


(11.16) 


(11.17) 


As  in  Example  2  the  solution  is  determined  once  the  Lagrange  multipliers  are 
known.  They  must  be  selected  so  that  the  solution  satisfies  the  two  constraints. 


11.4  Examples 
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It  is  useful  to  point  out  that  problems  of  this  type  may  have  local  minimum  points. 
The  reader  can  examine  this  by  considering  a  short  chain  of,  say,  four  links  and  v 
and  w  configurations. 

Example  5  ( Portfolio  Design).  Suppose  there  are  n  securities  indexed  by  i  =  1,2, 
. . . ,  n.  Each  security  i  is  characterized  by  its  random  rate  of  return  rt  which  has 
mean  value  rt.  Its  covariances  with  the  rates  of  return  of  other  securities  are  cry,  for 
j  =  1,2,  . . . ,  n.  The  portfolio  problem  is  to  allocate  total  available  wealth  among 
these  n  securities,  allocating  a  fraction  w;  of  wealth  to  the  security  i. 

The  overall  rate  of  return  of  a  portfolio  is  r  =  Y!i=\  wiE  and  variance  cr2  =  YJ! j=  \ 

WiCTijWj. 

Markowitz  introduced  the  concept  of  devising  efficient  portfolios  which  for  a 
given  expected  rate  of  return  f  have  minimum  possible  variance.  Such  a  portfolio  is 
the  solution  to  the  problem 


Zn 

•  •  ,  WiCTijWj 
hj=  1 


subject  to 


The  second  constraint  forces  the  sum  of  the  weights  to  equal  one.  There  may  be  the 
further  restriction  that  each  w;  >  0  which  would  imply  that  the  securities  must  not 
be  shorted  (that  is,  sold  short). 

Introducing  Lagrange  multipliers  A  and  p  for  the  two  constraints  leads  easily  to 
the  n  +  2  linear  equations 


n 


^  o~ijWj  +  Ari  +  p  =  0  for  i  =  1,2, 


n 


j=  i 


=  r 


in  the  n  +  2  unknowns  (the  w/’s,  A  and  p). 


Large-Scale  Applications 

The  problems  that  serve  as  the  primary  motivation  for  the  methods  described  in 
this  part  of  the  book  are  actually  somewhat  different  in  character  than  the  prob¬ 
lems  represented  by  the  above  examples,  which  by  necessity  are  quite  simple. 
Larger,  more  complex,  nonlinear  programming  problems  arise  frequently  in  modern 
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applied  analysis  in  a  wide  variety  of  disciplines.  Indeed,  within  the  past  few  decades 
nonlinear  programming  has  advanced  from  a  relatively  young  and  primarily  analytic 
subject  to  a  substantial  general  tool  for  problem  solving. 

Large  nonlinear  programming  problems  arise  in  problems  of  mechanical  struc¬ 
tures,  such  as  determining  optimal  configurations  for  bridges,  trusses,  and  so  forth. 
Some  mechanical  designs  and  configurations  that  in  the  past  were  found  by  solving 
differential  equations  are  now  often  found  by  solving  suitable  optimization  prob¬ 
lems.  An  example  that  is  somewhat  similar  to  the  hanging  chain  problem  is  the 
determination  of  the  shape  of  a  stiff  cable  suspended  between  two  points  and  sup¬ 
porting  a  load. 

A  wide  assortment,  of  large-scale  optimization  problems  arise  in  a  similar  way  as 
methods  for  solving  partial  differential  equations.  In  situations  where  the  underlying 
continuous  variables  are  defined  over  a  two-  or  three-dimensional  region,  the  con¬ 
tinuous  region  is  replaced  by  a  grid  consisting  of  perhaps  several  thousand  discrete 
points.  The  corresponding  discrete  approximation  to  the  partial  differential  equation 
is  then  solved  indirectly  by  formulating  an  equivalent  optimization  problem.  This 
approach  is  used  in  studies  of  plasticity,  in  heat  equations,  in  the  flow  of  fluids,  in 
atomic  physics,  and  indeed  in  almost  all  branches  of  physical  science. 

Problems  of  optimal  control  lead  to  large-scale  nonlinear  programming  prob¬ 
lems.  In  these  problems  a  dynamic  system,  often  described  by  an  ordinary  differ¬ 
ential  equation,  relates  control  variables  to  a  trajectory  of  the  system  state.  This 
differential  equation,  or  a  discretized  version  of  it,  defines  one  set  of  constraints. 
The  problem  is  to  select  the  control  variables  so  that  the  resulting  trajectory  satisfies 
various  additional  constraints  and  minimizes  some  criterion.  An  early  example  of 
such  a  problem  that  was  solved  numerically  was  the  determination  of  the  trajectory 
of  a  rocket  to  the  moon  that  required  the  minimum  fuel  consumption. 

There  are  many  examples  of  nonlinear  programming  in  industrial  operations  and 
business  decision  making.  Many  of  these  are  nonlinear  versions  of  the  kinds  of  ex¬ 
amples  that  were  discussed  in  the  linear  programming  part  of  the  book.  Nonlinear¬ 
ities  can  arise  in  production  functions,  cost  curves,  and,  in  fact,  in  almost  all  facets 
of  problem  formulation. 

Portfolio  analysis,  in  the  context  of  both  stock  market  investment  and  evaluation 
of  a  complex  project  within  a  firm,  is  an  area  where  nonlinear  programming  is  be¬ 
coming  increasingly  useful.  These  problems  can  easily  have  thousands  of  variables. 

In  many  areas  of  model  building  and  analysis,  optimization  formulations  are 
increasingly  replacing  the  direct  formulation  of  systems  of  equations.  Thus  large 
economic  forecasting  models  often  determine  equilibrium  prices  by  minimizing  an 
objective  termed  consumer  surplus.  Physical  models  are  often  formulated  as  mini¬ 
mization  of  energy.  Decision  problems  are  formulated  as  maximizing  expected  util¬ 
ity.  Data  analysis  procedures  are  based  on  minimizing  an  average  error  or  maxi¬ 
mizing  a  probability.  As  the  methodology  for  solution  of  nonlinear  programming 
improves,  one  can  expect  that  this  trend  will  continue. 
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11.5  Second-Order  Conditions 

By  an  argument  analogous  to  that  used  for  the  unconstrained  case,  we  can  also  derive 
the  corresponding  second-order  conditions  for  constrained  problems.  Throughout 
this  section  it  is  assumed  that  /,  h  e  C2. 

Second-Order  Necessary  Conditions.  Suppose  that  x*  is  a  local  minimum  of  f  subject  to 
h(x)  =  0  and  that  x*  is  a  regular  point  of  these  constraints.  Then  there  is  a  A  e  Em  such  that 

V/(x*)  +  ArVh(x*)  =  0.  (11.18) 

If  we  denote  by  M  the  tangent  plane  M  -  { y  :  Vh(x*)y  =  0},  then  the  matrix 

L(x*)  =  F(x*)  +  ArH(x*)  (11.19) 

is  positive  semidefinite  on  M,  that  is,  yrL(x*)y  >  0  for  all  y  e  M. 

Proof.  From  elementary  calculus  it  is  clear  that  for  every  twice  differentiable  curve 
on  the  constraint  surface  S  through  x*  (with  x(0)  =  x*)  we  have 

>0.  (11.20) 

t-0 


By  definition 


=  x(0)rF(x*)x(0)  +  V/(x*)x(0).  (1 1.21) 

t= 0 

T  •  • 

Furthermore,  differentiating  the  relation  A  h (x(t))  =  0  twice,  we  obtain 

x(0)rArH(x*)x(0)  +  ArVh(x*)x(0)  =  0.  (1 1.22) 

Adding  (11.22)  to  (11.21),  while  taking  account  of  (11.20),  yields  the  result 

=  x(0)rL(x*)x(0)  >  0. 

t= 0 

Since  x(0)  is  arbitrary  in  M,  we  immediately  have  the  stated  conclusion.  I 

The  above  theorem  is  our  first  encounter  with  the  matrix  L  =  F  +  A  H  which 
is  the  matrix  of  second  partial  derivatives,  with  respect  to  x,  of  the  Lagrangian  /. 
(See  Appendix  A,  Sect.  A. 6,  for  a  discussion  of  the  notation  A  H  used  here.)  This 
matrix  is  the  backbone  of  the  theory  of  algorithms  for  constrained  problems,  and  it 
is  encountered  often  in  subsequent  chapters. 

We  next  state  the  corresponding  set  of  sufficient  conditions. 

Second-Order  Sufficiency  Conditions.  Suppose  there  is  a  point  x*  satisfying  h(x*)  =  0,  and 
a  A  G  Em  such  that 


d 2 
dt 2 


mt )) 


d‘ 


dr 


■mt)) 


V/(x*)  +  ArVh(x*)  =  0. 


(11.23) 
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nr 

Suppose  also  that  the  matrix  L(x  )  =  F(x  )  +  A  H(x  )  is  positive  definite  on  M  =  {y  : 
Vh(x*)y  =  0),  that  is,  for  y  e  M,  y  ^  0  there  holds  yrL(x*)y  >  0.  Then  x*  is  a  strict  local 
minimum  of  f  subject  to  h(x)  =  0. 


Proof.  If  x*  is  not  a  strict  relative  minimum  point,  there  exists  a  sequence  of  feasible 
points  {y^}  converging  to  x*  such  that  for  each  k ,  /( yk)  <  /(x*).  Write  each  y^  in 
the  form  y^  =  x*  +  6k Sk  where  e  En,  |s^|  =  1,  and  6k  >  0  for  each  k.  Clearly, 
6k  —>  0  and  the  sequence  {s/J,  being  bounded,  must  have  a  convergent  subsequence 
converging  to  some  s*.  For  convenience  of  notation,  we  assume  that  the  sequence 
{s^}  is  itself  convergent  to  s*.  We  also  have  h(yk)-h(x*)  =  0,  and  dividing  by  6k  and 
letting  k  — >  oo  we  see  that  Vh(x*)s*  =  0. 

Now  by  Taylor’s  theorem,  we  have  for  each  j 

62 

0  =  hj(  yt)  =  hj(x*)  +  6kVhj(x*)  sk  +  -j-sTkV2hj(x]j)sk  (1 1.24) 

and 

o  >  f(yk)  -  Ax*)  =  6kVf(x*)sk  +  ys[V2/Olo)s*,  (1 1.25) 

where  each  r\j  is  a  point  on  the  line  segment  joining  x*  and  y^.  Multiplying  (11.24) 
by  A  j  and  adding  these  to  (1 1.25)  we  obtain,  on  accounting  for  (1 1.23), 


in 


V2/(i lo)  +  J]  A,V2/!,(ti,)  s*. 


i=  1 


which  yields  a  contradiction  as  k 


Example  1.  Consider  the  problem 


maximize  X\X2  +  V2V3  +  X\X2 
subject  to  x\  +  X2  +  V3  =  3. 


In  Example  1  of  Sect.  11.4  it  was  found  that  x\  =  X2  =  X3  =  1,  A  =  -2  satisfy  the 

T 

first-order  conditions.  The  matrix  F  +  A  H  becomes  in  this  case 


0  1 
1  0 
1  1 


1 

1 

0 


which  itself  is  neither  positive  nor  negative  definite.  On  the  subspace  M  -  {y:  y\  + 
yi  +  J3  =  0},  however,  we  note  that 

yrLy  =  yi(y2  +  b)  +  ^(yi  +  b)  +  B(ji  +  B) 

=  ~(yi  +  y\  +  y\\ 


and  thus  L  is  negative  definite  on  M.  Therefore,  the  solution  we  found  is  at  least  a 
local  maximum. 


11.6  Eigenvalues  in  Tangent  Subspace 
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11.6  Eigenvalues  in  Tangent  Subspace 

In  the  last  section  it  was  shown  that  the  matrix  L  restricted  to  the  subspace  M  that 
is  tangent  to  the  constraint  surface  plays  a  role  in  second-order  conditions  entirely 
analogous  to  that  of  the  Hessian  of  the  objective  function  in  the  unconstrained  case. 
It  is  perhaps  not  surprising,  in  view  of  this,  that  the  structure  of  L  restricted  to  M 
also  determines  rates  of  convergence  of  algorithms  designed  for  constrained  prob¬ 
lems  in  the  same  way  that  the  structure  of  the  Hessian  of  the  objective  function 
does  for  unconstrained  algorithms.  Indeed,  we  shall  see  that  the  eigenvalues  of  L 
restricted  to  M  determine  the  natural  rates  of  convergence  for  algorithms  designed 
for  constrained  problems.  It  is  important,  therefore,  to  understand  what  these  re¬ 
stricted  eigenvalues  represent.  We  first  determine  geometrically  what  we  mean  by 
the  restriction  of  L  to  M  which  we  denote  by  L m-  Next  we  define  the  eigenvalues  of 
the  operator  L,M.  Finally  we  indicate  how  these  various  quantities  can  be  computed. 

Given  any  vector  y  e  M,  the  vector  Ly  is  in  En  but  not  necessarily  in  M.  We 
project  Ly  orthogonally  back  onto  M,  as  shown  in  Fig.  1 1.4,  and  the  result  is  said  to 
be  the  restriction  of  L  to  M  operating  on  y.  In  this  way  we  obtain  a  linear  transforma¬ 
tion  from  M  to  M.  The  transformation  is  determined  somewhat  implicitly,  however, 
since  we  do  not  have  an  explicit  matrix  representation. 


A  vector  y  e  M  is  an  eigenvector  of  Lm  if  there  is  a  real  number  A  such  that 
L My  =  Ay;  the  corresponding  A  is  an  eigenvalue  of  L M.  This  coincides  with  the 
standard  definition.  In  terms  of  L  we  see  that  y  is  an  eigenvector  of  Lm  if  Ly  can  be 
written  as  the  sum  of  Ay  and  a  vector  orthogonal  to  M.  See  Fig.  11.5. 

To  obtain  a  matrix  representation  for  L m  it  is  necessary  to  introduce  a  basis  in 
the  subspace  M.  For  simplicity  it  is  best  to  introduce  an  orthonormal  basis,  say 
ei,  e2,  . . . ,  en_m.  Define  the  matrix  E  to  be  the  nx(n-m)  matrix  whose  columns 
consist  of  the  vectors  e*.  Then  any  vector  y  in  M  can  be  written  as  y  =  Ez  for  some 
z  e  En~m  and,  of  course,  LEz  represents  the  action  of  L  on  such  a  vector.  To  project 
this  result  back  into  M  and  express  the  result  in  terms  of  the  basis  ei,  e2,  . . . ,  en_m, 
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Ly 


Fig.  11.5  Eigenvector  of  LM 


we  merely  multiply  by  Er.  Thus  ErLEz  is  the  vector  whose  components  give  the 
representation  in  terms  of  the  basis;  and,  correspondingly,  the  (n  -  m)  x  (n  -  m) 
matrix  ErLE  is  the  matrix  representation  of  L  restricted  to  M. 

The  eigenvalues  of  L  restricted  to  M  can  be  found  by  determining  the  eigenvalues 
of  ErLE.  These  eigenvalues  are  independent  of  the  particular  orthonormal  basis  E. 


Example  1.  In  the  last  section  we  considered 


0  1 
1  0 
1  1 


1 

1 

0 


restricted  to  M  -  {y  :  y\  +  y2  +  J3  =  0}-  To  obtain  an  explicit  matrix  representation 
on  M  let  us  introduce  the  orthonormal  basis: 


ei  =  -T(1,0.  -l) 

V2 

e2  =  -U1’  -2’1)' 

V6 

This  gives,  upon  expansion, 


ErLE  = 


-1  0 
0  -1 


and  hence  L  restricted  to  M  acts  like  the  negative  of  the  identity. 
Example  2.  Let  us  consider  the  problem 


extremize  x\  +  x\  +  V2V3  +  2x\ 

1  'y  'y 

subject  to  -(v1  +  x2  +  x3)  =  1. 
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The  first-order  necessary  conditions  are 

1  +  Ax\  —  0 
2^2  +  +  Ax  2  =  0 

X2  +  4^3  +  AX3  =  0. 


One  solution  to  this  set  is  easily  seen  to  be  x\  =  1,  X2  =  0,  V3  =  0,  A  =  - 1.  Let  us 
examine  the  second-order  conditions  at  this  solution  point.  The  Lagrangian  matrix 
there  is 


-1  0 

0  1 

0  1 


and  the  corresponding  subspace  M  is 


M  =  {y  :  yi  =  0}. 


In  this  case  M  is  the  subspace  spanned  by  the  second  two  basis  vectors  in  E3  and 
hence  the  restriction  of  L  to  M  can  be  found  by  taking  the  corresponding  submatrix 
of  L.  Thus,  in  this  case, 


ErLE  = 


1 

3 


The  characteristic  polynomial  of  this  matrix  is 


det 


1  -A  1 
1  3  -A 


=  (1  -  A)(3  -  A)  -  1  =  A2  -  4A  +2. 


The  eigenvalues  of  L m  are  thus  A  =  2  +  \2,  and  L  v/  is  positive  definite. 

Since  the  matrix  is  positive  definite,  we  conclude  that  the  point  found  is  a 
relative  minimum  point.  This  example  illustrates  that,  in  general,  the  restriction  of 
L  to  M  can  be  thought  of  as  a  submatrix  of  L,  although  it  can  be  read  directly  from 
the  original  matrix  only  if  the  subspace  M  is  spanned  by  a  subset  of  the  original 
basis  vectors. 


Projected  Hessians 

The  above  approach  for  determining  the  eigenvalues  of  L  projected  onto  M  is  quite 
direct  and  relatively  simple.  There  is  another  approach,  however,  that  is  useful  in 
some  theoretical  arguments  and  convenient  for  simple  applications.  It  is  based  on 
constructing  matrices  and  determinants  of  order  n  rather  than  n  -  m,  but  there  is 
no  need  to  find  the  orthonormal  basis  E.  For  simplicity,  let  A  =  Vh  which  has  full 
row  rank. 
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Any  x  satisfying  Ax  =  0  can  be  expressed  as 

x  =  (I  -  Ar(AAr)-1A)z 


for  some  z  (and  the  converse  is  also  true),  where  Pa  =  (I  -  Ar(AAr)-1A)  is  the  so 
called  projection  matrix  to  the  null  space  of  A  (that  is,  M).  If  xTLx  >  0  for  all  x  in 
this  null  space,  then  z^PaLPaz  >  0  for  all  z  g  F,  or  the  ^-dimensional  symmetric 
matrix  PaLPa  is  positive  semidefinite.  Furthermore,  if  PaLPa  has  rank  n  —  m,  then 
is  positive  definite,  which  results  the  following  test. 

Projected  Hessian  Test.  The  matrix  L  is  positive  definite  on  the  subspace  M  -  {x  :  Vhx  =  0} 
if  and  only  if  the  projected  Hessian  matrix  to  the  null  space  of  Vh  is  positive  semidefinite 
and  has  rank  n  -  m. 


Example  3.  Approaching  Example  2  in  this  way  and  noting  A  =  Vh  =  (1,  0,  0)  we 
have 


Then 


1 

1 

T 

0 

0 

0 

I- 

0 

0 

— 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

PaLPa 

— 

0 

1 

1 

0 

1 

3 

which  is  clearly  positive  semidefinite  and  has  rank  2. 


11.7  Sensitivity 

The  Lagrange  multipliers  associated  with  a  constrained  minimization  problem  have 
an  interpretation  as  prices,  similar  to  the  prices  associated  with  constraints  in  linear 
programming.  In  the  nonlinear  case  the  multipliers  are  associated  with  the  particu¬ 
lar  solution  point  and  correspond  to  incremental  or  marginal  prices,  that  is,  prices 
associated  with  small  variations  in  the  constraint  requirements. 

Suppose  the  problem 


minimize  /(x) 

subject  to  h(x)  =  0  (11.26) 

has  a  solution  at  the  point  x*  which  is  a  regular  point  of  the  constraints.  Let  A  be  the 
corresponding  Lagrange  multiplier  vector.  Now  consider  the  family  of  problems 

minimize  fix) 

subject  to  h(x)  =  c,  (11.27) 

where  c  e  Em.  For  a  sufficiently  small  range  of  c  near  the  zero  vector,  the  problem 
will  have  a  solution  point  x(c)  near  x(0)  =  x*.  For  each  of  these  solutions  there  is  a 
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corresponding  value  /(x(c)),  and  this  value  can  be  regarded  as  a  function  of  c,  the 
right-hand  side  of  the  constraints.  The  components  of  the  gradient  of  this  function 
can  be  interpreted  as  the  incremental  rate  of  change  in  value  per  unit  change  in 
the  constraint  requirements.  Thus,  they  are  the  incremental  prices  of  the  constraint 
requirements  measured  in  units  of  the  objective.  We  show  below  how  these  prices 
are  related  to  the  Lagrange  multipliers  of  the  problem  having  c  =  0. 

Sensitivity  Theorem.  Let  /,  h  e  C 2  and  consider  the  family  of  problems 

minimize  fix) 

subject  to  h(x)  =  c.  (11.29) 

Suppose  for  c  =  0  there  is  a  local  solution  x*  that  is  a  regular  point  and  that ,  together  with 
its  associated  Lagrange  multiplier  vector  A,  satisfies  the  second-order  sufficiency  conditions 
for  a  strict  local  minimum.  Then  for  every  c  £  Emin  a  region  containing  0  there  is  an  x(c), 
depending  continuously  on  c,  such  that  x(0)  =  x*  and  such  that  x(c)  is  a  local  minimum 
of  (11 .27).  Furthermore , 


Vc/(X(C))]C=0  =  -A7. 

Proof.  Consider  the  system  of  equations 

V/(x)  +  A7-  Vh(x)  =  0  (11.30) 

h(x)  =  c.  (11.31) 

By  hypothesis,  there  is  a  solution  x*,  A  to  this  system  when  c  =  0.  The  Jacobian 
matrix  of  the  system  at  this  solution  is 

L(x*)  Vh(x*)r 

Vh(x)  0 

Because  by  assumption  x*  is  a  regular  point  and  L(x*)  is  positive  definite  on  M, 

it  follows  that  this  matrix  is  nonsingular  (see  Exercise  11).  Thus,  by  the  Implicit 
Function  Theorem,  there  is  a  solution  x(c),  A(c)  to  the  system  which  is  in  fact  con¬ 
tinuously  differentiable. 

By  the  chain  rule  we  have 

Vc/(x(c))]c=0  =  Vx/(x*)Vcx(0). 

and 

Vch(x(c))]c=0  =  Vxh(x*)Vcx(0). 

In  view  of  (1 1.31),  the  second  of  these  is  equal  to  the  identity  I  on  Em ,  while  this,  in 
view  of  (11.30),  implies  that  the  first  can  be  written 

Vc/(X(C))]C=0  = -A7. 1 
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11.8  Inequality  Constraints 

We  consider  now  problems  of  the  form 

minimize  /(x) 

subject  to  h(x)  =  0,  g(x)  <  0.  (11.32) 

We  assume  that  /  and  h  are  as  before  and  that  g  is  a  p-dimensional  function.  Initially, 
we  assume  /,  h,  g  e  C1. 

There  are  a  number  of  distinct  theories  concerning  this  problem,  based  on  various 
regularity  conditions  or  constraint  qualifications,  which  are  directed  toward  obtain¬ 
ing  definitive  general  statements  of  necessary  and  sufficient  conditions.  One  can  by 
no  means  pretend  that  all  such  results  can  be  obtained  as  minor  extensions  of  the 
theory  for  problems  having  equality  constraints  only.  To  date,  however,  these  al¬ 
ternative  results  concerning  necessary  conditions  have  been  of  isolated  theoretical 
interest  only — for  they  have  not  had  an  influence  on  the  development  of  algorithms, 
and  have  not  contributed  to  the  theory  of  algorithms.  Their  use  has  been  limited  to 
small-scale  programming  problems  of  two  or  three  variables.  We  therefore  choose  to 
emphasize  the  simplicity  of  incorporating  inequalities  rather  than  the  possible  com¬ 
plexities,  not  only  for  ease  of  presentation  and  insight,  but  also  because  it  is 
this  viewpoint  that  forms  the  basis  for  work  beyond  that  of  obtaining  necessary 
conditions. 


First-Order  Necessary  Conditions 

With  the  following  generalization  of  our  previous  definition  it  is  possible  to  parallel 
the  development  of  necessary  conditions  for  equality  constraints. 

Definition.  Let  x*  be  a  point  satisfying  the  constraints 

h(x*)  =  0,  g(x*)  <  0,  (11.33) 

and  let  J  be  the  set  of  indices  j  for  which  gj(x*)  =  0.  Then  x*  is  said  to  be  a  regular  point 
of  the  constraints  (11.33)  if  the  gradient  vectors  V/z;(x*),  Vg,-(x*),  1  <  i  <  m,  j  e  J  are 
linearly  independent. 

We  note  that,  following  the  definition  of  active  constraints  given  in  Sect.  11.1,  a 
point  x*  is  a  regular  point  if  the  gradients  of  the  active  constraints  are  linearly  inde¬ 
pendent.  Or,  equivalently,  x*  is  regular  for  the  constraints  if  it  is  regular  in  the  sense 
of  the  earlier  definition  for  equality  constraints  applied  to  the  active  constraints. 

Karush-Kuhn-Tucker  Conditions.  Let  x*  be  a  relative  minimum  point  for  the  problem 


minimize  /(x) 

subject  to  h(x)  =  0,  g(x)  <  0, 


(11.34) 


11.8  Inequality  Constraints 


341 


and  suppose  x*  is  a  regular  point  for  the  constraints.  Then  there  is  a  vector  A  e  Em  and  a 
vector  \i  e  Ep  with  \i  >  0  such  that 

V/(x*)  +  ArVh(x*)  +  pTVg(x*)  =  0  (11.35) 

pTg(x*)  =  0.  (11.36) 


Proof.  We  note  first,  since  \x  >  0  and  g(x*)  <  0,  (11.36)  is  equivalent  to  the  state¬ 
ment  that  a  component  of  may  be  nonzero  only  if  the  corresponding  constraint 
is  active.  This  a  complementary  slackness  condition,  stating  that  g(x*);  <  0  implies 
Pi  =  0  and  pi  >  0  implies  g(x*);  =  0. 

Since  x*  is  a  relative  minimum  point  over  the  constraint  set,  it  is  also  a  relative 
minimum  over  the  subset  of  that  set  defined  by  setting  the  active  constraints  to  zero. 
Thus,  for  the  resulting  equality  constrained  problem  defined  in  a  neighborhood  of 
x*,  there  are  Lagrange  multipliers.  Therefore,  we  conclude  that  (11.35)  holds  with 
pj  =  0  if  gj(x*)  ^  0  (and  hence  (11.36)  also  holds). 

It  remains  to  be  shown  that  \x  >  0.  Suppose  pk  <  0  for  some  k  e  J.  Let  S  and  M 
be  the  surface  and  tangent  plane,  respectively,  defined  by  all  other  active  constraints 
at  x*.  By  the  regularity  assumption,  there  is  a  y  such  that  y  e  M  and  Vg^(x*)y  <  0. 
Let  x(t)  be  a  curve  on  S  passing  through  x*(at  t  -  0)  with  x(0)  =  y.  Then  for  small 
t  >  0,  x(t)  is  feasible,  and 


=  V/(x*)y  <  0 


by  (11.35),  which  contradicts  the  minimality  of  x*. 


I 


Example.  Consider  the  problem 


minimize  2x\  +  lx\X2  +  x\  -  10vi  -  10v2 
subject  to  x\  +  4  < 5 
3x\  +  X2  <  6. 

The  first-order  necessary  conditions,  in  addition  to  the  constraints,  are 

Ax\  +  2x2  -  10  +  2p\X\  +  3p2  =  0 
2x\  +  2^2  —  10  +  2p\X2  +  p2  —  0 

Pi  >  0,  p2  >  0 

pi(x\  +  x\  -  5)  =  0 
P2( 3xi  +  x2  -  6)  =  0. 

To  find  a  solution  we  define  various  combinations  of  active  constraints  and  check 
the  signs  of  the  resulting  Lagrange  multipliers.  In  this  problem  we  can  try  setting 
none,  one,  or  two  constraints  active.  Assuming  the  first  constraint  is  active  and  the 
second  is  inactive  yields  the  equations 
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4x\  +  2^2  -  10  +  2/i\  x\  =  0 
2xi  +  2x2  -  10  +  2/i]X2  =  0 


which  has  the  solution 

Xi  =  1,  x2  =  2,  /i  i  =  1. 

This  yields  3xi  +  X2  =  5  and  hence  the  second  constraint  is  satisfied.  Thus,  since 
/i\  >  0,  we  conclude  that  this  solution  satisfies  the  first-order  necessary  conditions. 


Second-Order  Conditions 

The  second-order  conditions,  both  necessary  and  sufficient,  for  problems  with  in¬ 
equality  constraints,  are  derived  essentially  by  consideration  only  of  the  equality 
constrained  problem  that  is  implied  by  the  active  constraints.  The  appropriate  tan¬ 
gent  plane  for  these  problems  is  the  plane  tangent  to  the  active  constraints. 

Second-Order  Necessary  Conditions.  Suppose  the  functions  /,  g,  h  e  C2  and  that  x*  is  a 
regular  point  of  the  constraints  (11.33 ).  Ifx*  is  a  relative  minimumpoint  for  problem  (11.32 ), 
then  there  is  a  Ae  Em,  p  G  Ep,  p  >  0  such  that  (11.35)  and  (36)  hold  and  such  that 

L(x*)  =  F(x*)  +  ArH(x”)  +  firG(x*)  (11.37) 

is  positive  semidefinite  on  the  tangent  subspace  of  the  active  constraints  at  x*. 

Proof.  If  x*  is  a  relative  minimum  point  over  the  constraints  (11.33),  it  is  also  a 
relative  minimum  point  for  the  problem  with  the  active  constraints  taken  as  equality 
constraints.  I 

Just  as  in  the  theory  of  unconstrained  minimization,  it  is  possible  to  formulate  a 
converse  to  the  Second-Order  Necessary  Condition  Theorem  and  thereby  obtain  a 
Second-Order  Sufficiency  Condition  Theorem.  By  analogy  with  the  unconstrained 
situation,  one  can  guess  that  the  required  hypothesis  is  that  L(x*)  be  positive  definite 
on  the  tangent  plane  M.  This  is  indeed  sufficient  in  most  situations.  However,  if  there 
are  degenerate  inequality  constraints  (that  is,  active  inequality  constraints  having 
zero  as  associated  Lagrange  multiplier),  we  must  require  L(x*)  to  be  positive  definite 
on  a  subspace  that  is  larger  than  M. 

Second-Order  Sufficiency  Conditions.  Let  f,  g,  h  e  C2.  Sufficient  conditions  that  a  point 
x*  satisfying  (33)  be  a  strict  relative  minimum  point  of  problem  (11.32)  is  that  there  exist 
A  G  Em,  p  e  Ep,  such  that 


p  >  0 

(11.38) 

nrg(x*)  =  0 

(11.39) 

V/(x‘)  +  ATVh(x‘)  +  (/'  Vg(x*)  =  0, 

(11.40) 
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and  the  Hessian  matrix 


L(x")  =  F(x*)  +  ArH(x*)  +  /urG(x*)  (11.41) 

is  positive  definite  on  the  subspace 

M'  =  { y  :  Vh(x*)y  =  0,  Vgfix*) y  =  0  for  all  j  e  j] , 

where  J  =  [j  :  gfix*)  =  0,  pj  >  0). 

Proof.  As  in  the  proof  of  the  corresponding  theorem  for  equality  constraints  in 
Sect.  11.5,  assume  that  x*  is  not  a  strict  relative  minimum  point;  let  {y^}  be  a  se¬ 
quence  of  feasible  points  converging  to  x*  such  that  /( yk)  <  /(x*),  and  write  each 
yk  in  the  form  y^  =  x*  +  with  |s^|  =  1,  5^  >  0.  We  may  assume  that  8^  — ■>  0  and 
Sk  — >  s*.  We  have  0  >  V/(x*) s*,  and  for  each  i  =  1,  . . . ,  m  we  have 

VA/(x*)s*  =  0. 

Also  for  each  active  constraint  gj  we  have  gj(yk)  -  g7-(x*)  <  0,  and  hence 

Vg/x*)s*  <  0. 

If  Vgj(x*)s*  =  0  for  all  j  e  /,  then  the  proof  goes  through  just  as  in  Sect.  11.5.  If 
Vgj(x*)s*  <  0  for  at  least  one  j  €  /,  then 

0  >  v/(x*)s*  =  -ArVh(x*)s*  -  |uTVg(x*)s*  >  0, 

which  is  a  contradiction.  I 

We  note  in  particular  that  if  all  active  inequality  constraints  have  strictly  positive 
corresponding  Lagrange  multipliers  (no  degenerate  inequalities),  then  the  set  /  in¬ 
cludes  all  of  the  active  inequalities.  In  this  case  the  sufficient  condition  is  that  the 
Lagrangian  be  positive  definite  on  M,  the  tangent  plane  of  active  constraints. 


Sensitivity 

The  sensitivity  result  for  problems  with  inequalities  is  a  simple  restatement  of  the 
result  for  equalities.  In  this  case,  a  nondegeneracy  assumption  is  introduced  so  that 
the  small  variations  produced  in  Lagrange  multipliers  when  the  constraints  are  var¬ 
ied  will  not  violate  the  positivity  requirement. 

Sensitivity  Theorem.  Let  f,  g,  h  e  C 2  and  consider  the  family  of  problems 

minimize  /(x) 

subject  to  h(x)  =  c,  g(x)  <  d.  (11.42) 

Suppose  that  for  c  =  0,  d  =  0,  there  is  a  local  solution  x*  that  is  a  regularpoint  and 
that,  together  with  the  associated  Lagrange  multipliers ,  A,  p  >  0, satisfies  the  second-order 
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sufficiency  conditions  for  a  strict  local  minimum  Assume  further  that  no  active  inequality 
constraint  is  degenerate.  Then  for  every  (c,  d)  e  Em+P  in  a  region  containing  (0,  0)  there  is 
a  solution  x(c,  d), depending  continuously  on  (c,  d),  such  that  x(0,  0)  =  x*,  and  such  that 
x(c,  d )is  a  relative  minimum  point  of  (11.42).  Furthermore , 

Vc/(X(C,  d))]00  =  -\T  (11.43) 

Vd/(x(c,  d))]00  =  -|iT.  (11.44) 


11.9  Zero-Order  Conditions  and  Lagrangian  Relaxation 

Zero-order  conditions  for  functionally  constrained  problems  express  conditions  in 
terms  of  Lagrange  multipliers  without  the  use  of  derivatives.  This  theory  is  not  only 
of  great  practical  value,  but  it  also  gives  new  insight  into  the  meaning  of  Lagrange 
multipliers.  Rather  than  regarding  the  Lagrange  multipliers  as  separate  scalars,  they 
are  identified  as  components  of  a  single  vector  that  has  a  strong  geometric  interpre¬ 
tation.  As  before,  the  basic  constrained  problem  is 

minimize  /(x) 

subject  to  h(x)  =  0,  g(x)  <  0  (1 1.45) 

x  e  Q, 

where  x  is  a  vector  in  En,  and  h  and  g  are  m- dimensional  and  p-dimensional  func¬ 
tions,  respectively. 

In  purest  form,  zero-order  conditions  require  that  the  functions  that  define  the 
objective  and  the  constraints  are  convex  functions  and  sets.  (See  Appendix B.) 
The  vector-valued  function  g  consisting  of  p  individual  component  functions 
g i,  g2,  . . . ,  gp  is  said  to  be  convex  if  each  of  the  component  functions  is  convex. 

The  programming  problem  (1 1.45)  above  is  termed  a  convex  programming  prob¬ 
lem  if  the  functions  /  and  g  are  convex,  the  function  h  is  affine  (that  is,  linear  plus  a 
constant  and  can  be  written  as  Ax  -  b),  and  the  set  Q  c  En  is  convex. 

Notice  that  according  to  Propositions,  Sect.  7.4,  the  set  defined  by  each  of  the 
inequalities  gj(x)  <  0  is  convex.  This  is  true  also  of  a  set  defined  by  hfx)  =  0. 
Since  the  overall  constraint  set  is  the  intersection  of  these  and  Q  it  follows  from 
Proposition  1  of  Appendix  B  that  this  overall  constraint  set  is  itself  convex.  Hence 
the  problem  can  be  regarded  as  minimize  /(x),  x  e  Qi  where  Qi  is  a  convex  sub¬ 
set  of  fl. 

With  this  view,  one  could  apply  the  zero-order  conditions  of  Sect.  7.6  to  the  prob¬ 
lem  with  constraint  set  Qi.  However,  in  the  case  of  functional  constraints  it  is  com¬ 
mon  to  keep  the  structure  of  the  constraints  explicit  instead  of  folding  them  into  an 
amorphous  set. 

Although  it  is  possible  to  derive  the  zero-order  conditions  for  (1 1.45)  all  at  once, 
treating  both  equality  and  inequality  constraints  together,  it  is  notationally  cumber¬ 
some  to  do  so  and  it  may  obscure  the  basic  simplicity  of  the  arguments.  For  this 
reason,  we  treat  equality  constraints  first,  then  inequality  constraints,  and  finally  the 
combination  of  the  two. 
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The  equality  problem  is 


minimize  /(x) 

subject  to  h(x)  =  0  (1 1.46) 

x  e  Q. 

Letting  Y  =  Em,  we  have  h(x)  e  Y  for  all  x.  For  this  problem  we  require  a  regularity 
condition. 

Definition.  An  affine  function  h  is  regular  with  respect  to  f2  if  the  set  C  in  Y  defined  by 
C  =  {y  :  h(x)  =  y  for  some  x  e  Q}  contains  an  open  sphere  around  0;  that  is,  C  contains  a 
set  of  the  form  {y  :  |y|  <  s}  for  some  s  >  0. 

This  condition  means  that  h(x)  can  attain  0  and  can  vary  in  arbitrary  directions 
from  0.  Notice  that  this  condition  is  similar  to  the  definition  of  a  regular  point  in 
the  context  of  first-order  conditions.  If  h  has  continuous  derivatives  at  a  point  x* 
the  earlier  regularity  condition  implies  that  Vh(x*)  is  of  full  rank  and  the  Implicit 
Function  Theorem  (of  Appendix  A)  then  guarantees  that  there  is  an  s  >  0  such  that 
for  any  y  with  |y  -  h(x*)|  <  s  there  is  an  x  such  that  h(x)  =  y.  In  other  words,  there 
is  an  open  sphere  around  y*  =  h(x*)  that  is  attainable.  In  the  present  situation  we 
assume  this  attainability  directly,  at  the  point  0  e  Y. 

Next  we  introduce  the  following  important  construction. 

Definition.  The  primal  function  associated  with  problem  (1 1.46)  is 

w(y)  =  inf {/(x)  :  h(x)  =  y,  x  e  Q], 


defined  for  all  y  e  C. 

Notice  that  the  primal  function  is  defined  by  varying  the  right  hand  side  of  the 
constraint.  The  original  problem  (11.46)  corresponds  to  tu(0).  The  primal  function 
is  illustrated  in  Fig.  11.6. 

Proposition  1.  Suppose  is  convex,  the  function  f  is  convex,  and  h  is  affine.  Then  the 
primal  function  to  is  convex. 


Fig.  11.6  The  primal  function 
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Proof.  For  simplicity  of  notation  we  assume  that  Q  is  the  entire  space  X.  Then  we 
observe 

□(ay!  +  (1  -  a) y2)  =  inf{/(x) :  h(x)  =  ayx  +  (1  -  a)y2) 

<  inf{/(x) :  x  =  orxi  +  (1  -  a)x2,  h(xO  =  yl5  h(x2)  =  y2} 

<  arinf{/(xi)  :  h(xi)  =  yx}  +  (1  -  a)inf{/x2)  :  h(x2)  =  y2} 

=  <tw(y1)  +  (l-aMy2).l 

We  now  turn  to  the  derivation  of  the  Lagrange  multiplier  result  for  (1 1.46). 

Proposition  2.  Assume  that  £1  c  En  is  convex,  f  is  a  convex  function  on  £2  and  h  is  an 
m-dimensional  affine  function  on  £2.  Assume  that  h  is  regular  with  respect  to  £2.  If  x* 
solves  (11.46),  then  there  is  A  e  E,n  such  that  x*  solves  the  Lagrangian  relaxation  prob¬ 
lem 


minimize  /(x)  +  Arh(x) 
sub  ject  to  x  6  £2. 

Proof.  Let  /*  =  fix*).  Define  the  sets  A  and  B  in  Em+l  as 

A  =  {(r,  y)  :  r  >  cj(y),  yeC)  and  B  =  {(r,  y)  :  r  <  /*,  y  =  0}. 

A  is  the  epigraph  of  co  (see  Sect.  7.6)  and  B  is  the  vertical  line  extending  below  f* 
and  aligned  with  the  origin.  Both  A  and  B  are  convex  sets.  Their  only  common  point 
is  at  (f*,  0).  See  Fig.  11.7. 


Fig.  11.7  The  sets  A  and  B  and  the  separating  hyperplane 


According  to  the  separating  hyperplane  theorem  (Appendix  B),  there  is  a  hyper¬ 
plane  separating  A  and  B.  This  hyperplane  can  be  represented  by  a  nonzero  vector 
in  Em+ 1  of  the  form  (s,  A),  with  A  e  Em ,  and  a  separation  constant  c.  The  separation 
conditions  are 

T  T 

sr  A  A  y  >  c  for  all  (r,  y)  e  A  and  sr  +  A  y  <  c  for  all  (r,  y)  e  B. 

It  follows  immediately  that  s  >  0  for  otherwise  points  (r,  0)  e  B  with  r  very  negative 
would  violate  the  second  inequality. 
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Geometrically,  if  s  =  0  the  hyperplane  would  be  vertical.  We  wish  to  show  that 
s  ^  0,  and  it  is  for  this  purpose  that  we  make  use  of  the  regularity  condition.  Suppose 
s  =  0.  Then  A  ^  0  since  both  s  and  A  cannot  be  zero.  It  follows  from  the  second 
separation  inequality  that  c  -  0  because  the  hyperplane  must  include  the  point 
(/*,  0).  Now,  as  y  ranges  over  a  sphere  centered  at  0  e  C,  the  left  hand  side  of  the 
first  separation  inequality  ranges  correspondingly  over  A  y  which  is  negative  for 
some  y’s.  This  contradicts  the  first  separation  inequality.  Thus  s  ^  0  and  thus  in  fact 
s  >  0.  Without  loss  of  generality  we  may,  by  rescaling  if  necessary,  assume  that 
s  =  l. 

Finally,  suppose  x  e  Q.  Then  (/(x),  h(x))  e  A  and  (/(x*),  0)  e  B.  Thus,  from 
the  separation  inequality  (with  s  =  1)  we  have 

fix)  +  Vh(x)  >  fix*)  =  fix*)  +  Vh(x*). 

Hence  x*  solves  the  stated  minimization  problem.  I 

Example  1  ( Best  Rectangle ).  Consider  the  classic  problem  of  finding  the  rectangle 
of  maximum  area  while  limiting  the  perimeter  to  a  length  of  4.  This  can  be  formu¬ 
lated  as 


minimize  -  X\X2 
subject  to  x\  +  *2  -  2  =  0 

X\  >0,  X2  >  0. 

The  regularity  condition  is  met  because  it  is  possible  to  make  the  right  hand  side  of 
the  functional  constraint  slightly  positive  or  slightly  negative  with  nonnegative  x\ 
and  X2.  We  know  the  answer  to  the  problem  is  jci  =  JC2  =  1-  The  Lagrange  multiplier 
is  A  =  1 .  The  Lagrangian  problem  of  Proposition  2  is 

minimize  -  x±X2  +  1  •  (x\  +  X2  -  2) 
subject  to  x\  >0,  X2>  0. 

This  can  be  solved  by  differentiation  to  obtain  x\  =  X2  =  1. 

However  the  conclusion  of  the  proposition  is  not  satisfied!  The  value  of  the  La¬ 
grangian  at  the  solution  isV  =  — 1  +  1  +  1— 2  =  -1.  However,  at  x\  -  X2  -  0 
the  value  of  the  Lagrangian  is  V'  =  -2  which  is  less  than  V.  The  Lagrangian  is 
not  minimized  at  the  solution.  The  proposition  breaks  down  because  the  objective 
function /(vi ,  xi)  -  -X\X2  is  not  convex. 

Example  2  ( Best  Diagonal).  As  an  alternative  problem,  consider  minimizing  the 
length  of  the  diagonal  of  a  rectangle  subject  to  the  perimeter  being  of  length  4. 
This  problem  can  be  formulated  as 

minimize  v \  +  x\) 

subject  to  x\  +  V2  -  2  =  0 

X\  >0,  X2>  0. 
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In  this  case  the  objective  function  is  convex.  The  solution  is  x\  -  x^  -  1  and  the 
Lagrange  multiplier  is  A  =  - 1 .  The  Lagrangian  problem  is 


minimize 
subject  to 


1(4  +  4)  - 1  •  Ui  +  *2  -  2) 

X\  >  0,  V2  >  0. 


The  value  of  the  Lagrangian  at  the  solution  is  V  =  1  which  in  this  case  is  a  minimum 
as  guaranteed  by  the  proposition.  (The  value  at  x\  =  V2  =  0  is  V'  =  2.) 


Inequality  Constraints 

We  outline  the  parallel  results  for  the  inequality  constrained  problem 

minimize  /(x) 

subject  to  g(x)  <  0  (11.47) 

x  e  Q, 


where  g  is  a  p-dimensional  function. 

We  let  Z  -  Ep  and  define  DcZasD  =  {zeZ:  g(x)  <  z  for  some  x  e  Q}. 
The  regularity  condition  (called  the  Slater  condition)  is  that  there  is  a  z\  e  D  with 
zi  <  0. 

As  before  we  introduce  the  primal  function. 

Definition.  The  primal  function  associated  with  problem  (1 1.47)  is 

w(z)  =  inf{/(x)  :  g(x)  <  z,  x  e  Q). 


The  primal  function  is  again  defined  by  varying  the  right  hand  side  of  the  constraint 
function,  using  the  variable  z.  Now  the  primal  function  in  monotonically  decreasing 
with  z,  since  an  increase  in  z  enlarges  the  constraint  region. 

Proposition  3.  Suppose  f2  c  En  is  convex  and  f  and  g  are  convex  functions.  Then  the 
primal  function  m  is  also  convex. 

Proof.  The  proof  parallels  that  of  Proposition  1 .  One  simply  substitutes  g(x)  <  0  for 
h(x)  =  y  throughout  the  series  of  inequalities.  I 

The  zero-order  necessary  Lagrangian  conditions  are  then  given  by  the  proposi¬ 
tion  below. 

Proposition  4.  Assume  f2  is  a  convex  subset  of  En  and  that  f  and  g  are  convex  functions. 
Assume  also  that  there  is  a  point  Xi  G  Q  such  that  g(xi)  <  0.  Then,  ifx*  solves  (11.47), 
there  is  a  vector  \i  e  Ep  with  \x  >  0  such  that  x*  solves  the  Lagrangian  relaxation  problem 

minimize  f(x*)  +  \xTg(x)  (11.48) 

subject  to  x  E  fl. 

Furthermore,  prg(jc*)  =  0. 
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Proof.  Here  is  the  proof  outline.  Let  /*  =  /(x*).  In  this  case  define  in  Ep+l  the 
two  sets 

A  =  {(r,  0)  :  r  >  /(x),  0  >  g(x),  for  some  x  e  Q}  and  B  =  {(r,  0)  :  r  <  /*,  0  <  0}. 

A  is  the  epigraph  of  the  primal  function  co.  The  set  B  is  the  rectangular  region  at  or 
to  the  left  of  the  vertical  axis  and  at  or  lower  than  /*.  Both  A  and  B  are  convex.  See 
Fig. 11.8. 


r 


Fig.  11.8  The  sets  A  and  B  and  the  separating  hyperplane  for  inequalities 


The  proof  is  made  by  constructing  a  hyperplane  separating  A  and  B.  The  regu¬ 
larity  condition  guarantees  that  this  hyperplane  is  not  vertical.  I 

The  condition  |urg(x*)  =  0  is  the  complementary  slackness  condition  that  is 
characteristic  of  necessary  conditions  for  problems  with  inequality  constraints. 

Example  4  ( Quadratic  Program).  Consider  the  quadratic  program 

minimize  xrQx  +  cTx 
subject  to  a Tx  <  b 
x  >  0. 

Let  Q  =  {x  :  x  >  0}  and  g(x)  =  arx  -  b.  Assume  that  the  nxn  matrix  Q  is  positive 
definite,  in  which  case  the  objective  function  is  convex.  Assuming  that  b  >  0,  the 
Slater  regularity  condition  is  satisfied.  Hence  there  is  a  Lagrange  multiplier  p  >  0 
(a  scalar  in  this  case)  such  that  the  solution  x*  to  the  quadratic  program  is  also  a 
solution  to 


minimize  xrQx  +  cTx  +  /i(arx  -  b) 

r-p 

subject  to  x  >  0  and  p( a  x*  -  b)  =  0. 
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Mixed  Constraints 

The  two  previous  results  can  be  combined  to  obtain  zero-order  conditions  for  the 
problem 

minimize  /(x) 

subject  to  h(x)  =  0,  g(x)  <  0  (1 1.49) 

X  €  Q. 

Zero- order  Lagrange  Theorem.  Assume  that  QcEn  is  a  convex  set ,  f  and  g  are  convex 
functions  of  dimension  1  and  p,  respectively,  and  h  is  affine  ofdimension  m.  Assume  also 
that  h  satisfies  the  regularity  condition  with  respect  to  f2  and  that  there  is  anx\  £  f2  with 
h(xO  =  0  and  g(xi)  <  0.  Suppose  x*  solves  (11.49).  Then  there  are  vectors  A  £  Em  and 
p  e  Ep  with  p  >  0  such  that  x*  solves  the  Lagrangian  relaxation  problem 

minimize  /(x)  +  Arh(x)  +  prg(x)  (11.50) 

subject  to  x  e  Q. 

Furthermore ,  prg(x*)  =  0. 

The  convexity  requirements  of  this  result  are  satisfied  in  many  practical  problems. 
Indeed  convex  programming  problems  are  both  pervasive  and  relatively  well  treated 
by  theory  and  numerical  methods.  The  corresponding  theory  also  motivates  many 
approaches  to  general  nonlinear  programming  problems.  In  fact,  it  will  be  apparent 
that  many  methods  attempt  to  “convexify”  a  general  nonlinear  problem  either  by 
changing  the  formulation  of  the  underlying  application  or  by  introducing  devices 
that  temporarily  relax  as  the  method  progresses. 


Zero-Order  Sufficient  Conditions 

The  sufficiency  conditions  are  very  strong  and  do  not  require  convexity. 

Proposition  5  (Sufficiency  Conditions).  Suppose  f  is  a  real-valued  function  on  a  set  El  c. 

En.  Suppose  also  that  h  and  g  are,  respectively,  m-dimensionaland  p- dimensional  functions 
on  fl  Finally,  suppose  there  are  vectors  x*  e  El,  A  6  Em,  and  p  e  Ep  with  p  >0  such  that 

f(x*)  +  Arh(x")  +  |/g(x*)  <  fix)  +  Arh(x)  +  f/g(x) 

for  all  x  e  El.  Then  x*  solves 

minimize  /(x) 

subject  to  h(x)  =  h(x*),  g(x)  <  g(x*) 
x  e  El. 

Proof.  Suppose  there  is  xi  e  El  with  /(x i)  <  /(x*),  h(xi)  =  h(x*),  and  g(xi)  < 
g(x*).  From  p  >  Oitis  clear  that  prg(xi)  <  prg(x*).  It  follows  that/(xi)+Arh(xi)-i- 
prg(xi)  <  /(x*)  +  Arh(x*)  +  prg(x*),  which  is  a  contradiction.  I 

Notice  that  the  constraint  of  the  Lagrangian  relaxation  problem  is  significantly 
simpler,  and  typically  much  easier  to  solve  for  given  A  and  p.  The  result  suggests 
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that  Lagrange  multiplier  values  might  be  guessed  and  used  to  define  an  initial  La- 
grangian  relaxation  problem  which  is  subsequently  minimized.  This  will  produce 
a  solution  of  x  and  its  constraint  values.  If  these  values  meet  the  given  right-hand 
side  requirement,  then  x  is  optimal.  Otherwise,  one  may  adjust  Lagrange  multiplier 
values  accordingly.  Indeed,  this  approach,  the  Lagrangian  relaxation  method,  will 
be  characteristic  of  a  duality  method  treated  in  Chap.  14. 

The  theory  of  this  section  has  an  inherent  geometric  simplicity  captured  clearly 
by  Figs.  11.7  and  11.8.  It  raises  ones’  s  level  of  understanding  of  Lagrange  multi¬ 
pliers  and  sets  the  stage  for  the  theory  of  convex  duality  presented  in  Chap.  14.  It  is 
certainly  possible  to  jump  ahead  and  read  that  now. 


11.10  Summary 

Given  a  minimization  problem  subject  to  equality  constraints  in  which  all  functions 
are  smooth,  a  necessary  condition  satisfied  at  a  minimum  point  is  that  the  gradient 
of  the  objective  function  is  orthogonal  to  the  tangent  plane  of  the  constraint  surface. 
If  the  point  is  regular,  then  the  tangent  plane  has  a  simple  representation  in  terms  of 
the  gradients  of  the  constraint  functions,  and  the  above  condition  can  be  expressed 
in  terms  of  Lagrange  multipliers. 

If  the  functions  have  continuous  second  partial  derivatives  and  Lagrange  multi¬ 
pliers  exist,  then  the  Hessian  of  the  Lagrangian  restricted  to  the  tangent  plane  plays 
a  role  in  second-order  conditions  analogous  to  that  played  by  the  Hessian  of  the 
objective  function  in  unconstrained  problems.  Specifically,  the  restricted  Hessian 
must  be  positive  semidefinite  at  a  relative  minimum  point  and,  conversely,  if  it  is 
positive  definite  at  a  point  satisfying  the  first-order  conditions,  that  point  is  a  strict 
local  minimum  point. 

Inequalities  are  treated  by  determining  which  of  them  are  active  at  a  solution. 
An  active  inequality  then  acts  just  like  an  equality,  except  that  its  associated  La¬ 
grange  multiplier  can  never  be  negative  because  of  the  sensitivity  interpretation  of 
the  multipliers. 

The  necessary  conditions  for  convex  problems  can  be  expressed  without  deriva¬ 
tives,  and  these  are  according  termed  zero-order  conditions.  These  conditions  are 
highly  geometric  in  character  and  explicitly  treat  the  Lagrange  multiplier  as  a  vector 
in  a  space  having  dimension  equal  to  that  of  the  right-hand- side  of  the  constraints. 
This  Lagrange  multiplier  vector  defines  a  hyperplane  that  separates  the  epigraph  of 
the  primal  function  from  a  set  of  unattainable  objective  and  constraint  value  combi¬ 
nations. 

The  “zero-order”  optimality  condition  developed  in  this  chapter  establishes  a 
theoretical  base  of  the  Lagrangian  relaxation  method,  which  would  be  introduced 
later  and  is  extremely  popular  for  large-scale  optimization. 
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11.11  Exercises 

1 .  In  E2  consider  the  constraints 


X\  >  0,  X2  >  0 

X2  ~  (X\  ~  l)2  <  0. 


Show  that  the  point  x\  -  1 ,  X2  =  0  is  feasible  but  is  not  a  regular  point. 

2.  Find  the  rectangle  of  given  perimeter  that  has  greatest  area  by  solving  the  first- 
order  necessary  conditions.  Verify  that  the  second-order  sufficiency  conditions 
are  satisfied. 

3.  Verify  the  second-order  conditions  for  the  entropy  example  of  Sect.  1 1.4. 

4.  A  cardboard  box  for  packing  quantities  of  small  foam  balls  is  to  be  manufac¬ 
tured  as  shown  in  Fig.  11.9.  The  top,  bottom,  and  front  faces  must  be  of  double 
weight  (i.e.,  two  pieces  of  cardboard).  A  problem  posed  is  to  find  the  dimen¬ 
sions  of  such  a  box  that  maximize  the  volume  for  a  given  amount  of  cardboard, 
equal  to  72  sq.  ft. 

(a)  What  are  the  first-order  necessary  conditions? 

(b)  Find  x,  y,  z. 

(c)  Verify  the  second-order  conditions. 


Fig.  11.9  Packing  box 


5.  Define 


L  = 


4  3 
3  1 
2  1 


h  =  (1,1,0), 


and  let  M  be  the  subspace  consisting  of  those  points  x  =  (x\,  X2,  x^)  satisfying 

hrx  =  0. 


(a)  Find  L m- 

(b)  Find  the  eigenvalues  of  Lm  • 


11.11  Exercises 


353 


(c)  Find 


p(A)  =  det 


0 

-h 


hT 

L  -  Id 


(d)  Apply  the  projected  Hessian  test. 


6.  Show  that  zrx  =  0  for  all  x  satisfying  Ax  =  0  if  and  only  if  z  =  Arw  for 
some  w.  (Hint:  Use  the  Duality  Theorem  of  Linear  Programming.) 

7.  After  a  heavy  military  campaign  a  certain  army  requires  many  new  shoes.  The 
quartermaster  can  order  three  sizes  of  shoes.  Although  he  does  not  know  pre¬ 
cisely  how  many  of  each  size  are  required,  he  feels  that  the  demand  for  the  three 
sizes  are  independent  and  the  demand  for  each  size  is  uniformly  distributed  be¬ 
tween  zero  and  three  thousand  pairs.  He  wishes  to  allocate  his  shoe  budget  of 
$4,000  among  the  three  sizes  so  as  to  maximize  the  expected  number  of  men 
properly  shod.  Small  shoes  cost  $1  per  pair,  medium  shoes  cost  $2  per  pair,  and 
large  shoes  cost  $4  per  pair.  How  many  pairs  of  each  size  should  he  order? 

8.  Optimal  control.  A  one-dimensional  dynamic  process  is  governed  by  a  differ¬ 
ence  equation 

x(k  +  1)  =  (p(x(k),  u(k ),  k) 

with  initial  condition  v(0)  =  xq.  In  this  equation  the  value  x(k)  is  called  the  state 
at  step  k  and  u(k )  is  the  control  at  step  k.  Associated  with  this  system  there  is  an 
objective  function  of  the  form 


N 

J  -  ^  if/(x(k ),  u(k ),  k). 

k= 0 

In  addition,  there  is  a  terminal  constraint  of  the  form 

g(x(N +1))  =  0. 

The  problem  is  to  find  the  sequence  of  controls  u(0),  m(11.1),  m(11.2),  ...,  u(N) 
and  corresponding  state  values  to  minimize  the  objective  function  while  satisfy¬ 
ing  the  terminal  constraint.  Assuming  all  functions  have  continuous  first  partial 
derivatives  and  that  the  regularity  condition  is  satisfied,  show  that  associated 
with  an  optimal  solution  there  is  a  sequence  A(k),  k  -  0,1,  . . . ,  N  and  a  p  such 
that 


A(k  -  1)  =  A(k)cf)x(x(k),  u(k ),  k )  +  if/x(x(k ),  u(k),  k ),  k  =  1, 2,  . . . ,  N 
A(N)  =  pgx(x(N  +  1)) 

fu(x(k ),  u(k ),  k)  +  A(k)cf)u(x(k),  u(k ),  k)  =  0,  k  =  0, 1, 2,  . . . ,  N. 

9.  Generalize  Exercise  9  to  include  the  case  where  the  state  x(k)  is  an  ^-dimensional 
vector  and  the  control  u(k)  is  an  m-dimensional  vector  at  each  stage  k. 

10.  An  egocentric  young  man  has  just  inherited  a  fortune  F  and  is  now  planning 
how  to  spend  it  so  as  to  maximize  his  total  lifetime  enjoyment.  He  deduces 
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that  if  x(k)  denotes  his  capital  at  the  beginning  of  year  k ,  his  holdings  will  be 
approximately  governed  by  the  difference  equation 

x(k  +  1)  =  ax(k)  -  u(k ),  v(0)  =  F , 

where  a  >  1  (with  a  -  1  as  the  interest  rate  of  investment)  and  where  u{k)  is  the 
amount  spent  in  year  k.  He  decides  that  the  enjoyment  achieved  in  year  k  can 
be  expressed  as  i//(u(k))  where  his  utility  function,  is  a  smooth  function,  and 
that  his  total  lifetime  enjoyment  is 


N 


k=0 


where  the  term  J3k( 0  <  [3  <  1)  reflects  the  notion  that  future  enjoyment  is 
counted  less  today.  The  young  man  wishes  to  determine  the  sequence  of  ex¬ 
penditures  that  will  maximize  his  total  enjoyment  subject  to  the  condition 
x(N  +  1)  =  0. 


(a)  Find  the  general  optimality  relationship  for  this  problem. 

(b)  Find  the  solution  for  the  special  case  if/(u)  =  ul/2. 

1 1 .  Let  A  be  an  m  x  n  matrix  of  rank  m  and  let  L  be  an  n  x  n  matrix  that  is  symmetric 
and  positive  definite  on  the  subspace  M  =  {x  :  Ax  =  0}.  Show  that  the  (n  +  m)  x 
(n  +  m)  matrix 

L  AT~ 

A  0 

is  nonsingular. 

12.  Consider  the  quadratic  program 


minimize 
subject  to 


lxrQx  -  brx 
Ax  =  c. 


Prove  that  x*  is  a  local  minimum  point  if  and  only  if  it  is  a  global  minimum 
point.  (No  convexity  is  assumed.) 

13.  Maximize  14x  -  v2  +  6y  -  y2  +  7  subject  to  v  +  y  <  2,  x  +  2y  <  3. 

14.  In  the  quadratic  program  example  of  Sect.  11.9,  what  are  more  general  condi¬ 
tions  on  a  and  b  that  satisfy  the  Slater  condition? 

15.  What  are  the  general  zero-order  Lagrangian  conditions  for  the  problem  (1 1.46) 
without  the  regularity  condition?  [The  coefficient  of  /  will  be  zero,  so  there  is 
no  real  condition.] 

16.  Show  that  the  problem  of  finding  the  rectangle  of  maximum  area  with  a  diago¬ 
nal  of  unit  length  can  be  formulated  as  an  unconstrained  convex  programming 
problem  using  trigonometric  functions.  [Hint:  use  variable  6  over  the  range 
0  <  0  <  45°.] 
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Chapter  12 

Primal  Methods 


In  this  chapter  we  initiate  the  presentation,  analysis,  and  comparison  of  algorithms 
designed  to  solve  constrained  minimization  problems.  The  four  chapters  that  con¬ 
sider  such  problems  roughly  correspond  to  the  following  classification  scheme. 
Consider  a  constrained  minimization  problem  having  n  variables  and  m  constraints. 
Methods  can  be  devised  for  solving  this  problem  that  work  in  spaces  of  dimension 
n  -  m,  n ,  m,  or  n  +  m.  Each  of  the  following  chapters  corresponds  to  methods  in 
one  of  these  spaces.  Thus,  the  methods  in  the  different  chapters  represent  quite  dif¬ 
ferent  approaches  and  are  founded  on  different  aspects  of  the  theory.  However,  there 
are  also  strong  interconnections  between  the  methods  of  the  various  chapters,  both 
in  the  final  form  of  implementation  and  in  their  performance.  Indeed,  there  soon 
emerges  the  theme  that  the  rates  of  convergence  of  most  practical  algorithms  are 
determined  by  the  structure  of  the  Hessian  of  the  Lagrangian  much  like  the  struc¬ 
ture  of  the  Hessian  of  the  objective  function  determines  the  rates  of  convergence 
for  a  wide  assortment  of  methods  for  unconstrained  problems.  Thus,  although  the 
various  algorithms  of  these  chapters  differ  substantially  in  their  motivation,  they  are 
ultimately  found  to  be  governed  by  a  common  set  of  principles. 


12.1  Advantage  of  Primal  Methods 

We  consider  the  question  of  solving  the  general  nonlinear  programming  problem 

minimize  /(x) 

subject  to  h(x)  =  0,  g(x)  <  0 

where  x  is  of  dimension  n ,  while  /,  g,  and  h  have  dimensions  1,  p ,  and  m,  respec¬ 
tively.  It  is  assumed  throughout  the  chapter  that  all  of  the  functions  have  continuous 
partial  derivatives  of  order  three.  Geometrically,  we  regard  the  problem  as  that  of 
minimizing  /  over  the  region  in  En  defined  by  the  constraints. 
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By  a  primal  method  of  solution  we  mean  a  search  method  that  works  on  the 
original  problem  directly  by  searching  through  the  feasible  region  for  the  optimal 
solution.  Each  point  in  the  process  is  feasible  and  the  value  of  the  objective  func¬ 
tion  constantly  decreases.  For  a  problem  with  n  variables  and  having  m  equality 
constraints  only,  primal  methods  work  in  the  feasible  space,  which  has  dimension 
n  -  m. 

Primal  methods  possess  three  significant  advantages  that  recommend  their  use 
as  general  procedures  applicable  to  almost  all  nonlinear  programming  problems. 
First,  since  each  point  generated  in  the  search  procedure  is  feasible,  if  the  process 
is  terminated  before  reaching  the  solution  (as  practicality  almost  always  dictates 
for  nonlinear  problems),  the  terminating  point  is  feasible.  Thus  this  final  point  is  a 
feasible  and  probably  nearly  optimal  solution  to  the  original  problem  and  therefore 
may  represent  an  acceptable  solution  to  the  practical  problem  that  motivated  the 
nonlinear  program.  A  second  attractive  feature  of  primal  methods  is  that,  often,  it 
can  be  guaranteed  that  if  they  generate  a  convergent  sequence,  the  limit  point  of  that 
sequence  must  be  at  least  a  local  constrained  minimum.  Finally,  a  major  advantage 
is  that  most  primal  methods  do  not  rely  on  special  problem  structure,  such  as  con¬ 
vexity,  and  hence  these  methods  are  applicable  to  general  nonlinear  programming 
problems. 

Primal  methods  are  not,  however,  without  major  disadvantages.  They  require  a 
phase  I  procedure  (see  Sect.  3.5)  to  obtain  an  initial  feasible  point,  and  they  are  all 
plagued,  particularly  for  problems  with  nonlinear  constraints,  with  computational 
difficulties  arising  from  the  necessity  to  remain  within  the  feasible  region  as  the 
method  progresses.  Some  methods  can  fail  to  converge  for  problems  with  inequality 
constraints  unless  elaborate  precautions  are  taken. 

The  convergence  rates  of  primal  methods  are  competitive  with  those  of  other 
methods,  and  particularly  for  linear  constraints,  they  are  often  among  the  most  effi¬ 
cient.  On  balance  their  general  applicability  and  simplicity  place  these  methods  in  a 
role  of  central  importance  among  nonlinear  programming  algorithms. 


12.2  Feasible  Direction  Methods 

The  idea  of  feasible  direction  methods  is  to  take  steps  through  the  feasible  region  of 
the  form 

^k+ 1  —  x £  +  cr^d^,  (12.2) 

where  d k  is  a  direction  vector  and  is  a  nonnegative  scalar.  The  scalar  is  chosen 
to  minimize  the  objective  function  /  with  the  restriction  that  the  point  x^+i  and 
the  line  segment  joining  and  xk+i  be  feasible.  Thus,  in  order  that  the  process 
of  minimizing  with  respect  to  a  be  nontrivial,  an  initial  segment  of  the  ray  x^  + 
adk,  a  >  0  must  be  contained  in  the  feasible  region.  This  motivates  the  use  of 
feasible  directions  for  the  directions  of  search.  We  recall  from  Sect.  7.1  that  a  vector 
dk  is  a  feasible  direction  (at  xf)  if  there  is  an  a  >  0  such  that  Xk  +  adk  is  feasible 
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for  all  a,  0  <  a  <  a.  A  feasible  direction  method  can  be  considered  as  a  natural 
extension  of  our  unconstrained  descent  methods.  Each  step  is  the  composition  of 
selecting  a  feasible  direction  and  a  constrained  line  search. 

Let  us  consider  the  problem  with  linear  inequality  constraints 


minimize 
subject  to 


/(x) 

ajx  <  i  =  1 


m. 


(12.3) 


Example  1  (Frank-Wolfe  Method).  One  of  the  earliest  proposals  for  a  feasible  direc¬ 
tion  method  uses  a  sequential  linear  programming  subproblem  approach.  Given  a 
feasible  point  x^,  the  direction  vector 

dk  =  *1  ~  xk 

where  x*k  is  a  solution  to  the  linear  program 

minimize  V/(x^)x 

subject  to  ajx  <  bt,  i  =  1, . . m. 

Example  2  ( Simplified  Zoutendijk  Method).  Another  proposal  solves  a  sequence  of 
linear  subprograms  as  follows.  Given  a  feasible  point,  x&,  let  I  be  the  set  of  indices 
representing  active  constraints,  that  is,  aj Xk  -  bi  for  i  e  /.  The  direction  vector  dk 
is  then  chosen  as  a  solution  to  the  linear  program 


minimize  V/(x,)d 
subject  to  afd«0, 

...  =  1, 


l 

n 


i  e  I 


(12.4) 


Z 

1=1 


where  d  =  (d\,  J2,  •  •  • ,  dn).  The  last  equation  is  a  normalizing  equation  that  en¬ 
sures  a  bounded  solution.  (Even  though  it  is  written  in  terms  of  absolute  values,  the 
problem  can  be  converted  to  a  linear  program;  see  Exercise  1 .)  The  other  constraints 
assure  that  vectors  of  the  form  x^  +  adk  will  be  feasible  for  sufficiently  small  a  >  0, 
and  subject  to  these  conditions,  d  is  chosen  to  line  up  as  closely  as  possible  with 
the  negative  gradient  of  /.  In  some  sense  this  will  result  in  the  locally  best  direc¬ 
tion  in  which  to  proceed.  The  overall  procedure  progresses  by  generating  feasible 
directions  in  this  manner,  and  moving  along  them  to  decrease  the  objective. 

There  are  two  major  shortcomings  of  feasible  direction  methods  that  require  that 
they  be  modified  in  most  cases.  The  first  shortcoming  is  that  for  general  problems 
there  may  not  exist  any  feasible  directions.  If,  for  example,  a  problem  had  nonlinear 
equality  constraints,  we  might  find  ourselves  in  the  situation  depicted  by  Fig.  12.1 
where  no  straight  line  from  Xk  has  a  feasible  segment.  For  such  problems  it  is  nec¬ 
essary  either  to  relax  our  requirement  of  feasibility  by  allowing  points  to  deviate 
slightly  from  the  constraint  surface  or  to  introduce  the  concept  of  moving  along 
curves  rather  than  straight  lines. 
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Feasible 

NCI 


Fig.  12.1  No  feasible  direction 


A  second  shortcoming  is  that  in  simplest  form  most  feasible  direction  methods, 
such  as  the  simplified  Zoutendijk  method,  are  not  globally  convergent.  They  are  sub¬ 
ject  to  jamming  (sometimes  referred  to  as  zigzagging)  where  the  sequence  of  points 
generated  by  the  process  converges  to  a  point  that  is  not  even  a  constrained  local 
minimum  point.  This  phenomenon  can  be  explained  by  the  fact  that  the  algorithmic 
map  is  not  closed. 

It  is  possible  to  develop  feasible  direction  algorithms  that  are  closed  and  hence 
not  subject  to  jamming.  Some  procedures  for  doing  so  are  discussed  in  Exercises 
4-7.  However,  such  methods  can  become  somewhat  complicated.  A  simpler  ap¬ 
proach  for  treating  inequality  constraints  is  to  use  an  active  set  method,  as  discussed 
in  the  next  section. 


12.3  Active  Set  Methods 

The  idea  underlying  active  set  methods  is  to  partition  inequality  constraints  into 
two  groups:  those  that  are  to  be  treated  as  active  and  those  that  are  to  be  treated  as 
inactive.  The  constraints  treated  as  inactive  are  essentially  ignored. 

Consider  the  constrained  problem 

minimize  /(x) 
subject  to  g(x)  <0, 

which  for  simplicity  of  the  current  discussion  is  taken  to  have  inequality  constraints 
only.  The  inclusion  of  equality  constraints  is  straightforward,  as  will  become  clear. 
The  necessary  conditions  for  this  problem  are 

V/(x)  +  ArVg(x)  =  0 
g(x)  <  0 

Arg(x)  =  0  (12.6) 

A  >  0. 

(See  Sect.  11.8.)  These  conditions  can  be  expressed  in  a  somewhat  simpler  form  in 
terms  of  the  set  of  active  constraints.  Let  A  denote  the  index  set  of  active  constraints; 
that  is,  A  is  the  set  of  i  such  that  g;(x*)  =  0.  Then  the  necessary  conditions  (12.6) 
become 
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V/(X)  +  2  A,Vg,(x)  =  0 

ieA 

gi(x)  =  0,  ieA 

gi(x)  <  0,  it  A 

A t  >  0,  ieA 

A  i  =  0,  it  A 


(12.7) 


The  first  two  lines  of  these  conditions  correspond  identically  to  the  necessary  condi¬ 
tions  of  the  equality  constrained  problem  obtained  by  requiring  the  active  constraints 
to  be  zero.  The  next  line  guarantees  that  the  inactive  constraints  are  satisfied,  and  the 
sign  requirement  of  the  Lagrange  multipliers  guarantees  that  every  constraint  that  is 
active  should  be  active. 

It  is  clear  that  if  the  active  set  were  known,  the  original  problem  could  be  replaced 
by  the  corresponding  problem  having  equality  constraints  only.  Alternatively,  sup¬ 
pose  an  active  set  was  guessed  and  the  corresponding  equality  constrained  problem 
solved.  Then  if  the  other  constraints  were  satisfied  and  the  Lagrange  multipliers 
turned  out  to  be  nonnegative,  that  solution  would  be  correct. 

The  idea  of  active  set  methods  is  to  define  at  each  step,  or  at  each  phase,  of 
an  algorithm  a  set  of  constraints,  termed  the  working  set ,  that  is  to  be  treated  as 
the  active  set.  The  working  set  is  chosen  to  be  a  subset  of  the  constraints  that  are 
actually  active  at  the  current  point,  and  hence  the  current  point  is  feasible  for  the 
working  set.  The  algorithm  then  proceeds  to  move  on  the  surface  defined  by  the 
working  set  of  constraints  to  an  improved  point.  At  this  new  point  the  working 
set  may  be  changed.  Overall,  then,  an  active  set  method  consists  of  the  following 
components:  (1)  determination  of  a  current  working  set  that  is  a  subset  of  the  current 
active  constraints,  and  (2)  movement  on  the  surface  defined  by  the  working  set  to  an 
improved  point. 

There  are  several  methods  for  determining  the  movement  on  the  surface  defined 
by  the  working  set.  (This  surface  will  be  called  the  working  surface.)  The  most  im¬ 
portant  of  these  methods  are  discussed  in  the  following  sections.  The  direction  of 
movement  is  generally  determined  by  first-order  or  second-order  approximations 
of  the  functions  at  the  current  point  in  a  manner  similar  to  that  for  unconstrained 
problems.  The  asymptotic  convergence  properties  of  active  set  methods  depend  en¬ 
tirely  on  the  procedure  for  moving  on  the  working  surface,  since  near  the  solution 
the  working  set  is  generally  equal  to  the  correct  active  set,  and  the  process  simply 
moves  successively  on  the  surface  determined  by  those  constraints. 


Changes  in  Working  Set 

Suppose  that  for  a  given  working  set  W  the  problem  with  equality  constraints 

minimize  /(x) 

subject  to  gfx)  =  0,  i  e  W 
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is  solved  yielding  the  point  xw  that  satisfies  gi(xw)  <  0,  i  &  W.  This  point  satisfies 
the  necessary  conditions 


V/(xw)  +  A  jVgi(xw)  =  0.  (12.8) 

ieW 

If  A/  >  0  for  all  i  e  W,  then  the  point  xw  is  a  local  solution  to  the  original  prob¬ 
lem.  If,  on  the  other  hand,  there  is  an  i  e  W  such  that  A,  <  0,  then  the  objective 
can  be  decreased  by  relaxing  constraint  i.  This  follows  directly  from  the  sensitiv¬ 
ity  interpretation  of  Lagrange  multipliers,  since  a  small  decrease  in  the  constraint 
value  from  0  to  —c  would  lead  to  a  change  in  the  objective  function  of  A/c,  which 
is  negative.  Thus,  by  dropping  the  constraint  i  from  the  working  set,  an  improved 
solution  can  be  obtained.  The  Lagrange  multiplier  of  a  problem  thereby  serves  as 
an  indication  of  which  constraints  should  be  dropped  from  the  working  set.  This  is 
illustrated  in  Fig.  12.2.  In  the  figure,  x  is  the  minimum  point  of  /  on  the  surface  (a 
curve  in  this  case)  defined  by  g\(x)  =  0.  However,  it  is  clear  that  the  corresponding 
Lagrange  multiplier  Ai  is  negative,  implying  that  g\  should  be  dropped.  Since  V/ 
points  outside,  it  is  clear  that  a  movement  toward  the  interior  of  the  feasible  region 
will  indeed  decrease  /. 

During  the  course  of  minimizing  /(x)  over  the  working  surface,  it  is  necessary 
to  monitor  the  values  of  the  other  constraints  to  be  sure  that  they  are  not  violated, 
since  all  points  defined  by  the  algorithm  must  be  feasible.  It  often  happens  that 
while  moving  on  the  working  surface  a  new  constraint  boundary  is  encountered.  It 
is  then  convenient  to  add  this  constraint  to  the  working  set,  proceeding  on  a  surface 
of  one  lower  dimension  than  before.  This  is  illustrated  in  Fig.  12.3.  In  the  figure  the 
working  constraint  is  just  gi  =  0  for  xi,  X2,  X3.  A  boundary  is  encountered  at  the 
next  step,  and  therefore  g2  =  0  is  adjoined  to  the  set  of  working  constraints. 


Fig.  12.2  Constraint  to  be  dropped 


A  complete  active  set  strategy  for  systematically  dropping  and  adding  constraints 
can  be  developed  by  combining  the  above  two  ideas.  One  starts  with  a  given  working 
set  and  begins  minimizing  over  the  corresponding  working  surface.  If  new  constraint 
boundaries  are  encountered,  they  may  be  added  to  the  working  set,  but  no  constraints 
are  dropped  from  the  working  set.  Finally,  a  point  is  obtained  that  minimizes  / 
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Fig.  12.3  Constraint  added  to  working  set 


with  respect  to  the  current  working  set  of  constraints.  The  corresponding  Lagrange 
multipliers  are  determined,  and  if  they  are  all  nonnegative  the  solution  is  optimal. 
Otherwise,  one  or  more  constraints  with  negative  Lagrange  multipliers  are  dropped 
from  the  working  set.  The  procedure  is  reinitiated  with  this  new  working  set,  and  / 
will  strictly  decrease  on  the  next  step. 

An  active  set  method  built  upon  this  basic  active  set  strategy  requires  that  a  pro¬ 
cedure  be  defined  for  minimization  on  a  working  surface  that  allows  constraints  to 
be  added  to  the  working  set  when  they  are  encountered,  and  that,  after  dropping  a 
constraint,  insures  that  the  objective  is  strictly  decreased.  Such  a  method  is  guaran¬ 
teed  to  converge  to  the  optimal  solution,  as  shown  below. 

Active  Set  Theorem.  Suppose  that  for  every  subset  W  of  the  constraint  indices,  the  con¬ 
strained  problem 

minimize  fix) 

subject  to  gi(x)  =  0,  i  e  W  1-1 

is  well-defined  with  a  unique  nondegenerate  solution  ( that  is,  for  all  i  e  W,  \  ±  0).  Then 
the  sequence  of  points  generated  by  the  basic  active  set  strategy  converges  to  the  solution 
of  the  inequality  constrained  problem  (12.6). 

Proof.  After  the  solution  corresponding  to  one  working  set  is  found,  a  decrease  in 
the  objective  is  made,  and  hence  it  is  not  possible  to  return  to  that  working  set.  Since 
there  are  only  a  finite  number  of  working  sets,  the  process  must  terminate.  I 

The  difficulty  with  the  above  procedure  is  that  several  problems  with  incorrect 
active  sets  must  be  solved.  Furthermore,  the  solutions  to  these  intermediate  prob¬ 
lems  must,  in  general,  be  exact  global  minimum  points  in  order  to  determine  the 
correct  sign  of  the  Lagrange  multipliers  and  to  assure  that  during  the  subsequent 
descent  process  the  current  working  surface  is  not  encountered  again. 

In  practice  one  deviates  from  the  ideal  basic  method  outlined  above  by  dropping 
constraints  using  various  criteria  before  an  exact  minimum  on  the  working  surface 
is  found.  Convergence  cannot  be  guaranteed  for  many  of  these  methods,  and  in¬ 
deed  they  are  subject  to  zigzagging  (or  jamming)  where  the  working  set  changes 


364 


12  Primal  Methods 


an  infinite  number  of  times.  However,  experience  has  shown  that  zigzagging  is  very 
rare  for  many  algorithms,  and  in  practice  the  active  set  strategy  with  various  refine¬ 
ment  is  often  very  effective. 

It  is  clear  that  a  fundamental  component  of  an  active  set  method  is  the  algorithm 
for  solving  a  problem  with  equality  constraints  only,  that  is,  for  minimizing  on  the 
working  surface.  Such  methods  and  their  analyses  are  presented  in  the  following 
sections. 


12.4  The  Gradient  Projection  Method 

The  gradient  projection  method  is  motivated  by  the  ordinary  method  of  steepest 
descent  for  unconstrained  problems.  The  negative  gradient  is  projected  onto  the 
working  surface  in  order  to  define  the  direction  of  movement.  We  present  it  here  in 
a  simplified  form  that  is  based  on  a  pure  active  set  strategy. 


Linear  Constraints 

Consider  first  problems  of  the  form 

minimize  /(x) 

subject  to  ajx^bi,  i  e  I\  (12.10) 

a Jx  =  bt,  i  e  /2 

having  linear  equalities  and  inequalities. 

A  feasible  solution  to  the  constraints,  if  one  exists,  can  be  found  by  application 
of  the  phase  I  procedure  of  linear  programming;  so  we  shall  always  assume  that 
our  descent  process  is  initiated  at  such  a  feasible  point.  At  a  given  feasible  point  x 
there  will  be  a  certain  number  q  of  active  constraints  satisfying  a J  x  =  bi  and  some 
inactive  constraints  afx<  bi.  We  initially  take  the  working  set  W(x)  to  be  the  set  of 
active  constraints. 

At  the  feasible  point  x  we  seek  a  feasible  direction  vector  d  satisfying  V/(x)d  < 
0,  so  that  movement  in  the  direction  d  will  cause  a  decrease  in  the  function  /.  Ini¬ 
tially,  we  consider  directions  satisfying  a;rd  =  0,  i  €  W(x)  so  that  all  working 
constraints  remain  active.  This  requirement  amounts  to  requiring  that  the  direction 
vector  d  lie  in  the  tangent  subspace  M  defined  by  the  working  set  of  constraints.  The 
particular  direction  vector  that  we  shall  use  is  the  projection  of  the  negative  gradient 
onto  this  subspace. 

To  compute  this  projection  let  be  defined  as  composed  of  the  rows  of  working 
constraints.  Assuming  regularity  of  the  constraints,  as  we  shall  always  assume,  Aq 
will  be  a  q  x  n  matrix  of  rank  q  <  n.  The  tangent  subspace  M  in  which  d  must 
lie  is  the  subspace  of  vectors  satisfying  Aqd  =  0.  This  means  that  the  subspace  N 
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consisting  of  the  vectors  making  up  the  rows  of  (that  is,  all  vectors  of  the  form 
A^A  for  A  e  Eq)  is  orthogonal  to  M.  Indeed,  any  vector  can  be  written  as  the  sum  of 
vectors  from  each  of  these  two  complementary  subspaces.  In  particular,  the  negative 
gradient  vector  -g^  can  be  written 


-gt  =  d*  +  AjA*  (12.11) 

where  e  M  and  A^  e  Eq.  We  may  solve  for  A^  through  the  requirement  that 
A^d*  =  0.  Thus 

hqAk  =  -A9g*  -  (A9A[)A,  =  0,  (12.12) 

which  leads  to 

A  k  =  -{AqATq)~l  Aqgk  (12.13) 

and 

d,  =  -[I  -  ATq(AqATqylAq]gk  =  -P,g/C.  (12.14) 

The  matrix 

Pk  =  [I  -ATq(AqATqrlAq]  (12.15) 

is  called  the  projection  matrix  corresponding  to  the  subspace  M.  Action  by  it  on  any 
vector  yields  the  projection  of  that  vector  onto  M.  See  Exercises  8  and  9  for  other 
derivations  of  this  result. 

We  easily  check  that  if  ^  0,  then  it  is  a  direction  of  descent.  Since  g^  +  d^  is 
orthogonal  to  d^,  we  have 

g*d*  =  (Si  +  d[  -  d[)d,  =  -|d,|2. 

Thus  if  dk  as  computed  from  (12. 14)  turns  out  to  be  nonzero,  it  is  a  feasible  direction 
of  descent  on  the  working  surface. 

We  next  consider  selection  of  the  step  size.  As  a  is  increased  from  zero,  the 
point  x  +  ad  will  initially  remain  feasible  and  the  corresponding  value  of  /  will 
decrease.  We  find  the  length  of  the  feasible  segment  of  the  line  emanating  from  x 
and  then  minimize  /  over  this  segment.  If  the  minimum  occurs  at  the  endpoint,  a 
new  constraint  will  become  active  and  will  be  added  to  the  working  set. 

Next,  consider  the  possibility  that  the  projected  negative  gradient  is  zero.  We 
have  in  that  case 

V/(x,)  +  \[Aq  =  0,  (12.16) 

and  the  point  x^  satisfies  the  necessary  conditions  for  a  minimum  on  the  working 
surface.  If  the  components  of  \  corresponding  to  the  active  inequalities  are  all  non¬ 
negative,  then  this  fact  together  with  (12.16)  implies  that  the  Karush-Kuhn-Tucker 
conditions  for  the  original  problem  are  satisfied  at  xk  and  the  process  terminates.  In 
this  case  the  A&  found  by  projecting  the  negative  gradient  is  essentially  the  Lagrange 
multiplier  vector  for  the  original  problem  (except  that  zero-valued  multipliers  must 
be  appended  for  the  inactive  constraints). 

If,  however,  at  least  one  of  those  components  of  A^  is  negative,  it  is  possible,  by 
relaxing  the  corresponding  inequality,  to  move  in  a  new  direction  to  an  improved 
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point.  Suppose  that  A^,  the  jth  component  of  A&,  is  negative  and  the  indexing  is  ar¬ 
ranged  so  that  the  corresponding  constraint  is  the  inequality  ajx<  bj.  We  determine 
the  new  direction  vector  by  relaxing  the  jth  constraint  and  projecting  the  negative 
gradient  onto  the  subspace  determined  by  the  remaining  q  -  1  active  constraints.  Let 
A q  denote  the  matrix  with  row  a j  deleted.  We  have  for  some 

-gk  =  ATq\  (12.17) 

-gk=dk  +  Al\,  (12.18) 

where  is  the  projection  of  -g^  using  A^.  It  is  immediately  clear  that  d^  ^  0, 
since  otherwise  (12.18)  would  be  a  special  case  of  (12.17)  with  A ^  -  0  which 
is  impossible,  since  the  rows  of  Aq  are  linearly  independent.  From  our  previous 

work  we  know  that  gjd^  <  0.  Multiplying  the  transpose  of  (12.17)  by  d^  and  using 
Aqdk  =  0  we  obtain 

0  >  g[d*  =  -XjkaTjdk.  (12.19) 

Since  A ,jk  <  0  we  conclude  that  ajd*  <  0.  Thus  the  vector  d^  is  not  only  a  direction 

of  descent,  but  it  is  a  feasible  direction,  since  ajd^  =  0,  i  e  W(xk),  i  ±  j ,  and 

a^d^  <  0.  Hence  j  can  be  dropped  from  W(xk). 

In  summary,  one  step  of  the  algorithm  is  as  follows:  Given  a  feasible  point  x 

1.  Find  the  subspace  of  active  constraints  M,  and  form  A^,  W(x). 

2.  Calculate  P  =  I  -  A^(A^A^)_1A^  and  d  =  -PV/(x)r. 

3.  If  d  ^  0,  find  a i  and  achieving,  respectively, 

max{cr  :  x  +  ad  is  feasible} 
min {/(x  +  crd)  :  0  <  a  <  a\). 

Set  x  to  x  +  a2d  and  return  to  (12.1). 

4.  If  d  =  0,  find  A  =  -(AqATq)~lAqVf(x)T. 

(a)  If  A  j  >  0,  for  all  j  corresponding  to  active  inequalities,  stop;  x  satisfies  the 
Karush-Kuhn-Tucker  conditions. 

(b)  Otherwise,  delete  the  row  from  A^  corresponding  to  the  inequality  with  the 
most  negative  component  of  A  (and  drop  the  corresponding  constraint  from 
W(x))  and  return  to  (12.2). 

The  projection  matrix  need  not  be  recomputed  in  its  entirety  at  each  new  point. 
Since  the  set  of  active  constraints  in  the  working  set  changes  by  at  most  one  con¬ 
straint  at  a  time,  it  is  possible  to  calculate  one  required  projection  matrix  from  the 
previous  one  by  an  updating  procedure.  (See  Exercise  11.)  This  is  an  important  fea¬ 
ture  of  the  gradient  projection  method  and  greatly  reduces  the  computation  required 
at  each  step. 
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Example.  Consider  the  problem 


minimize 
subject  to 


x2  +  x2  +  x2  +  x2  -  2x\ 
2x\  +  X2  +  *3  +  4^4  =  7 
jq  +  X2  +  2x3  +  X4  =  6 
Xi  >  0,  i  =  1, 2, 3, 4. 


-  3x4 


(12.20) 


Suppose  that  given  the  feasible  point  x  =  (2, 2, 1, 0)  we  wish  to  find  the  direction  of 
the  projected  negative  gradient.  The  active  constraints  are  the  two  equalities  and  the 
inequality  X4  >  0.  Thus 

2  114 

112  1,  (12.21) 

0  00  1 


A,= 


and  hence 


A  Ar  - 


22  9  4 
9  7  1 
4  1  1 


After  considerable  calculation  we  then  find 


(AAy)-1  = 


1 

IT 


6  -5-19 
-5  6  14 
-19  14  73 


and  finally 


P  = 


1 

IT 


1-310 
-3  9-3  0 
1-310 
0  0  0  0 


(12.22) 


The  gradient  at  the  point  (2,  2,  1,  0)  is  g  =  (2, 4, 2,  -3)  and  hence  we  find 


or  normalizing  by  8/ 1 1 


d  =  -pg=  T(-S,24,-8,0), 


d  =  (-1,3, -1,0). 


(12.23) 


It  can  be  easily  verified  that  movement  in  this  direction  does  not  violate  the 
constraints. 


Nonlinear  Constraints 

In  extending  the  gradient  projection  method  to  problems  of  the  form 


minimize  /(x) 

subject  to  h(x)  =  0,  g(x)  <  0, 


(12.24) 
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the  basic  idea  is  that  at  a  feasible  point  one  determines  the  active  constraints  and 
projects  the  negative  gradient  onto  the  subspace  tangent  to  the  surface  determined 
by  these  constraints.  This  vector,  if  it  is  nonzero,  determines  the  direction  for  the 
next  step.  The  vector  itself,  however,  is  not  in  general  a  feasible  direction,  since  the 
surface  may  be  curved  as  illustrated  in  Fig.  12.4.  It  is  therefore  not  always  possible 
to  move  along  this  projected  negative  gradient  to  obtain  the  next  point. 

What  is  typically  done  in  the  face  of  this  difficulty  is  essentially  to  search  along 
a  curve  on  the  constraint  surface,  the  direction  of  the  curve  being  defined  by  the 
projected  negative  gradient.  A  new  point  is  found  in  the  following  way:  First,  a 
move  is  made  along  the  projected  negative  gradient  to  a  point  y.  Then  a  move  is 
made  in  the  direction  perpendicular  to  the  tangent  plane  at  the  original  point  to  a 
nearby  feasible  point  on  the  working  surface,  as  illustrated  in  Fig.  12.4.  Once  this 
point  is  found  the  value  of  the  objective  is  determined.  This  is  repeated  with  various 
y’s  until  a  feasible  point  is  found  that  satisfies  one  of  the  standard  descent  criteria 
for  improvement  relative  to  the  original  point. 


Fig.  12.4  Gradient  projection  method 


This  procedure  of  tentatively  moving  away  from  the  feasible  region  and  then 
coming  back  introduces  a  number  of  additional  difficulties  that  require  a  series  of 
interpolations  and  nonlinear  equation  solutions  for  their  resolution.  A  satisfactory 
general  routine  implementing  the  gradient  projection  philosophy  is  therefore  of  ne¬ 
cessity  quite  complex.  It  is  not  our  purpose  here  to  elaborate  on  these  details  but 
simply  to  point  out  the  general  nature  of  the  difficulties  and  the  basic  devices  for 
surmounting  them. 

One  difficulty  is  illustrated  in  Fig.  12.5.  If,  after  moving  along  the  projected  neg¬ 
ative  gradient  to  a  point  y,  one  attempts  to  return  to  a  point  that  satisfies  the  old 
active  constraints,  some  inequalities  that  were  originally  satisfied  may  then  be  vio¬ 
lated.  One  must  in  this  circumstance  use  an  interpolation  scheme  to  find  a  new  point 
y  along  the  negative  gradient  so  that  when  returning  to  the  active  constraints  no  orig¬ 
inally  nonactive  constraint  is  violated.  Finding  an  appropriate  y  is  to  some  extent  a 
trial  and  error  process.  Finally,  the  job  of  returning  to  the  active  constraints  is  itself 


12.4  The  Gradient  Projection  Method 


369 


a  nonlinear  problem  which  must  be  solved  with  an  iterative  technique.  Such  a  tech¬ 
nique  is  described  below,  but  within  a  finite  number  of  iterations,  it  cannot  exactly 
reach  the  surface.  Thus  typically  an  error  tolerance  6  is  introduced,  and  throughout 
the  procedure  the  constraints  are  satisfied  only  to  within  6. 

Computation  of  the  projections  is  also  more  difficult  in  the  nonlinear  case.  Lump¬ 
ing,  for  notational  convenience,  the  active  inequalities  together  with  the  equalities 
into  h(Xfc),  the  projection  matrix  at  is 

Pk  =  I  -  Vh(x/t)r[Vh(x/fc)Vh(x/t)r]-1Vh(x/fc).  (12.25) 

At  the  point  this  matrix  can  be  updated  to  account  for  one  more  or  one  less 
constraint,  just  as  in  the  linear  case.  When  moving  from  to  x^+i,  however,  Vh 
will  change  and  the  new  projection  matrix  cannot  be  found  from  the  old,  and  hence 
this  matrix  must  be  recomputed  at  each  step. 


Fig.  12.5  Interpolation  to  obtain  feasible  point 


The  most  important  new  feature  of  the  method  is  the  problem  of  returning  to  the 
feasible  region  from  points  outside  this  region.  The  type  of  iterative  technique  em¬ 
ployed  is  a  common  one  in  nonlinear  programming,  including  interior-point  meth¬ 
ods  of  linear  programming,  and  we  describe  it  here.  The  idea  is,  from  any  point  near 
x^,  to  move  back  to  the  constraint  surface  in  a  direction  orthogonal  to  the  tangent 
plane  at  x^.  Thus  from  a  point  y  we  seek  a  point  of  the  form  y  +  Vh (x^)7  a  =  y*  such 
that  h(y*)  =  0.  As  shown  in  Fig.  12.6  such  a  solution  may  not  always  exist,  but  it 
does  for  y  sufficiently  close  to  x*. 

To  find  a  suitable  first  approximation  to  a,  and  hence  to  y*,  we  linearize  the 
equation  at  x^  obtaining 

h(y  +  Vh(x,)ra)  -  h(y)  +  Vh(x*)Vh(x*)V  (12.26) 

the  approximation  being  accurate  for  \a\  and  |y  -  x|  small.  This  motivates  the  first 
approximation 
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(12.27) 

(12.28) 


Substituting  yi  for  y  and  successively  repeating  the  process  yields  the  sequence  {y7} 
generated  by 

y;+ 1  =  yj  -  Vh(x*)7'[Vh(xt)Vh(xt)I']-1h(y;),  (12.29) 

which,  started  close  enough  to  and  the  constraint  surface,  will  converge  to  a 
solution  y*.  We  note  that  this  process  requires  the  same  matrices  as  the  projection 
operation. 

The  gradient  projection  method  has  been  successfully  implemented  and  has  been 
found  to  be  effective  in  solving  general  nonlinear  programming  problems.  Success¬ 
ful  implementation  resolving  the  several  difficulties  introduced  by  the  requirement 
of  staying  in  the  feasible  region  requires,  as  one  would  expect,  some  degree  of  skill. 
The  true  value  of  the  method,  however,  can  be  determined  only  through  an  analysis 
of  its  rate  of  convergence. 


Fig.  12.6  Case  in  which  it  is  impossible  to  return  to  surface 


12.5  Convergence  Rate  of  the  Gradient  Projection  Method 

An  analysis  that  directly  attacked  the  nonlinear  version  of  the  gradient  projection 
method,  with  all  of  its  iterative  and  interpolative  devices,  would  quickly  become 
monstrous.  To  obtain  the  asymptotic  rate  of  convergence,  however,  it  is  not  neces¬ 
sary  to  analyze  this  complex  algorithm  directly — instead  it  is  sufficient  to  analyze  an 
alternate  simplified  algorithm  that  asymptotically  duplicates  the  gradient  projection 
method  near  the  solution.  Through  the  introduction  of  this  idealized  algorithm  we 
show  that  the  rate  of  convergence  of  the  gradient  projection  method  is  governed  by 
the  eigenvalue  structure  of  the  Hessian  of  the  Lagrangian  restricted  to  the  constraint 
tangent  subspace. 
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Geodesic  Descent 


For  simplicity  we  consider  first  the  problem  having  only  equality  constraints 


minimize  fix) 
subject  to  h(x)  =  0. 


(12.30) 


The  constraints  define  a  continuous  surface  Q  in  En. 

In  considering  our  own  difficulties  with  this  problem,  owing  to  the  fact  that  the 
surface  is  nonlinear  thereby  making  directions  of  descent  difficult  to  define,  it  is 
well  to  also  consider  the  problem  as  it  would  be  viewed  by  a  small  bug  confined  to 
the  constraint  surface  who  imagines  it  to  be  his  total  universe.  To  him  the  problem 
seems  to  be  a  simple  one.  It  is  unconstrained,  with  respect  to  his  universe,  and  is 
only  ( n  -  m)- dimensional.  He  would  characterize  a  solution  point  as  a  point  where 
the  gradient  of  /  (as  measured  on  the  surface)  vanishes  and  where  the  appropriate 
(n  -  m)- dimensional  Hessian  of  /  is  positive  semidefinite.  If  asked  to  develop  a 
computational  procedure  for  this  problem,  he  would  undoubtedly  suggest,  since  he 
views  the  problem  as  unconstrained,  the  method  of  steepest  descent.  He  would  com¬ 
pute  the  gradient,  as  measured  on  his  surface,  and  would  move  along  what  would 
appear  to  him  to  be  straight  lines. 

Exactly  what  the  bug  would  compute  as  the  gradient  and  exactly  what  he  would 
consider  as  straight  lines  would  depend  basically  on  how  distance  between  two 
points  on  his  surface  were  measured.  If,  as  is  most  natural,  we  assume  that  he  in¬ 
herits  his  notion  of  distance  from  the  one  which  we  are  using  in  En,  then  the  path 

Jr>X2 

'  \x(t)\dt  would  be 

X  | 

considered  a  straight  line  by  him.  Such  a  curve,  having  minimum  arc  length  between 
two  given  points,  is  called  a  geodesic. 

Returning  to  our  own  view  of  the  problem,  we  note,  as  we  have  previously,  that  if 
we  project  the  negative  gradient  onto  the  tangent  plane  of  the  constraint  surface  at  a 
point  X*;,  we  cannot  move  along  this  projection  itself  and  remain  feasible.  We  might, 
however,  consider  moving  along  a  curve  which  had  the  same  initial  heading  as  the 
projected  negative  gradient  but  which  remained  on  the  surface.  Exactly  which  such 
curve  to  move  along  is  somewhat  arbitrary,  but  a  natural  choice,  inspired  perhaps 
by  the  considerations  of  the  bug,  is  a  geodesic.  Specifically,  at  a  given  point  on  the 
surface,  we  would  determine  the  geodesic  curve  passing  through  that  point  that  had 
an  initial  heading  identical  to  that  of  the  projected  negative  gradient.  We  would  then 
move  along  this  geodesic  to  a  new  point  on  the  surface  having  a  lesser  value  of  /. 

The  idealized  procedure  then,  which  the  bug  would  use  without  a  second  thought, 
and  which  we  would  use  if  it  were  computationally  feasible  (which  it  definitely  is 
not),  would  at  a  given  feasible  point  x^  (see  Fig.  12.7): 


1.  Calculate  the  projection  p  of  -V/(x^)r  onto  the  tangent  plane  at  x*> 

2.  Find  the  geodesic,  x(f),  £  >  0,  of  the  constraint  surface  having  x(0) 
x(0)  =  p. 

3.  Minimize  f(x(t))  with  respect  to  t  >  0,  obtaining  4  and  x^+i  =  x(4). 


= 
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At  this  point  we  emphasize  that  this  technique  (which  we  refer  to  as  geodesic  de¬ 
scent)  is  proposed  essentially  for  theoretical  purposes  only.  It  does,  however,  capture 
the  main  philosophy  of  the  gradient  projection  method.  Furthermore,  as  the  step  size 
of  the  methods  go  to  zero,  as  it  does  near  the  solution  point,  the  distance  between 
the  point  that  would  be  determined  by  the  gradient  projection  method  and  the  point 
found  by  the  idealized  method  goes  to  zero  even  faster.  Thus  the  asymptotic  rates  of 
convergence  for  the  two  methods  will  be  equal,  and  it  is,  therefore,  appropriate  to 
concentrate  on  the  idealized  method  only. 

Our  bug  confined  to  the  surface  would  have  no  hesitation  in  estimating  the  rate 
of  convergence  of  this  method.  He  would  simply  express  it  in  terms  of  the  smallest 
and  largest  eigenvalues  of  the  Hessian  of  /  as  measured  on  his  surface.  It  should  not 
be  surprising,  then,  that  we  show  that  the  asymptotic  convergence  ratio  is 


A  -  a\2 
A  +  (2 


(12.31) 


Fig.  12.7  Geodesic  descent 

where  a  and  A  are,  respectively,  the  smallest  and  largest  eigenvalues  of  L,  the  Hes¬ 
sian  of  the  Lagrangian,  restricted  to  the  tangent  subspace  M.  This  result  parallels 
the  convergence  rate  of  the  method  of  steepest  descent,  but  with  the  eigenvalues 
determined  from  the  same  restricted  Hessian  matrix  that  is  important  in  the  general 
theory  of  necessary  and  sufficient  conditions  for  constrained  problems.  This  rate, 
which  almost  invariably  arises  when  studying  algorithms  designed  for  constrained 
problems,  will  be  referred  to  as  the  canonical  rate. 

We  emphasize  again  that,  since  this  convergence  ratio  governs  the  convergence 
of  a  large  family  of  algorithms,  it  is  the  formula  itself  rather  than  its  numerical 
value  that  is  important.  For  any  given  problem  we  do  not  suggest  that  this  ratio  be 
evaluated,  since  this  would  be  extremely  difficult.  Instead,  the  potency  of  the  result 
derives  from  the  fact  that  fairly  comprehensive  comparisons  among  algorithms  can 
be  made,  on  the  basis  of  this  formula,  that  apply  to  general  classes  of  problems 
rather  than  simply  to  particular  problems. 
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The  remainder  of  this  section  is  devoted  to  the  analysis  that  is  required  to  estab¬ 
lish  the  convergence  rate.  Since  this  analysis  is  somewhat  involved  and  not  crucial 
for  an  understanding  of  remaining  material,  some  readers  may  wish  to  simply  read 
the  theorem  statement  and  proceed  to  the  next  section. 


Geodesics 

Given  the  surface  12  =  {x  :  h(x)  =  0}  c  En,  a  smooth  curve,  x(t)  g  il,  0  <  K  f 
starting  at  x(0)  and  terminating  at  x(T)  that  minimizes  the  total  arc  length 

T 

\x(t)\dt 

with  respect  to  all  other  such  curves  on  12  is  said  to  be  a  geodesic  connecting  x(0) 
and  x(T). 

It  is  common  to  parameterize  a  geodesic  x(f),  0  <  t  <  T  so  that  \x(t)\  =  1.  The 
parameter  t  is  then  itself  the  arc  length.  If  the  parameter  t  is  also  regarded  as  time, 
then  this  parameterization  corresponds  to  moving  along  the  geodesic  curve  with  unit 
velocity.  Parameterized  in  this  way,  the  geodesic  is  said  to  be  normalized.  On  any 
linear  subspace  of  En  geodesics  are  straight  lines.  On  a  three-dimensional  sphere, 
the  geodesics  are  arcs  of  great  circles. 

It  can  be  shown,  using  the  calculus  of  variations,  that  any  normalized  geodesic 
on  12  satisfies  the  condition 


x(t)  =  VhT(x(t))(0(t)  (12.32) 

for  some  function  co  taking  values  in  Em.  Geometrically,  this  condition  says  that 
if  one  moves  along  the  geodesic  curve  with  unit  velocity,  the  acceleration  at  every 
point  will  be  orthogonal  to  the  surface.  Indeed,  this  property  can  be  regarded  as 
the  fundamental  defining  characteristic  of  a  geodesic.  To  stay  on  the  surface  12,  the 
geodesic  must  also  satisfy  the  equation 

Vh(x(0)x(0  =  0,  (12.33) 

since  the  velocity  vector  at  every  point  is  tangent  to  12.  At  a  regular  point  xo  these 
two  differential  equations,  together  with  the  initial  conditions  x(0)  =  xo,  x(0)  spec¬ 
ified,  and  |x(0)|  =  1,  uniquely  specify  a  curve  x(f),  t  >  0  that  can  be  continued  as 
long  as  points  on  the  curve  are  regular.  Furthermore,  \x(t)\  =  1  for  t  >  0.  Hence 
geodesic  curves  emanate  in  every  direction  from  a  regular  point.  Thus,  for  example, 
at  any  point  on  a  sphere  there  is  a  unique  great  circle  passing  through  the  point  in  a 
given  direction. 
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Lagrangian  and  Geodesics 

Corresponding  to  any  regular  point  x  e  12  we  may  define  a  corresponding  Lagrange 
multiplier  A(x)  by  calculating  the  projection  of  the  gradient  of  /  onto  the  tangent 
subspace  at  x,  denoted  Mix).  The  matrix  that,  when  operating  on  a  vector,  projects 
it  onto  M(x)  is 

P(x)  =  I  -  Vh(x)r  [ Vh(x) Vh(x)r  ]  “ 1  Vh(x), 
and  it  follows  immediately  that  the  projection  of  V/(x)r  onto  M(x)  has  the  form 

y(x)  =  [V/(x)  +  A(x)rVh(x)]r,  (12.34) 

where  A(x)  is  given  explicitly  as 

A(x)r  =  -  V/(x)  Vh(x)r  [  Vh(x)  Vh(x)r]“ 1 .  (12.35) 

T 

Thus,  in  terms  of  the  Lagrangian  function  /(x,  A)  =  /(x)  +  A  h(x),  the  projected 
gradient  is 

y(x)  =  Zx(x,  A(x))r.  (12.36) 

If  a  local  solution  to  the  original  problem  occurs  at  a  regular  point  x*  e  O,  then  as 
we  know 

/X(x*,  A(x*))  =  0,  (12.37) 

which  states  that  the  projected  gradient  must  vanish  at  x*.  Defining  L(x)  =  /xx(x,  A(x)) 
F(x)  +  A(x)rH(x)  we  also  know  that  at  x*  we  have  the  second-order  necessary  con¬ 
dition  that  L(x*)  is  positive  semidefinite  on  M(x*);  that  is,  zrL(x*)z  ^  0  for  all 
z  e  M(x*).  Equivalently,  letting 

L(x)  =  P(x)L(x)P(x),  (12.38) 

it  follows  that  L(x*)  is  positive  semidefinite. 

We  then  have  the  following  fundamental  and  simple  result,  valid  along  a  geodesic. 

Proposition  1.  Let  x(t),  0  <  t  <  T,  be  a  geodesic  on  Then 

■j/(x(0)  =  4(x,  A (x))x(f)  (12.39) 

dt 

~2  f  (x(0)  =  x(0rL(x(r))x(r).  (12.40) 

dt 1 


Proof.  We  have 


2/(x(0)  =  V/(x(0)x(r)  =  /X(x,  A(x))x(0, 
dt 

the  second  equality  following  from  the  fact  that  x(t)  £  M(x).  Next, 

ff(Mt))  =  x(f)rF(x(r))x(f)  +  V/(x(f))x(r).  (12.41) 

dr 
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But  differentiating  the  relation  A  h(x(f))  =  0  twice,  for  fixed  A,  yields 
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x(t)T\TH(x(t))x(t)  +  ArVh(x(f))x(0  =  0. 

Adding  this  to  (12.41),  we  have 

2l/(x(0)  =  mT(  F  +  ArH)x(0  +  (V/(x)  +  ArVh(x))x(0, 
dr 

which  is  true  for  any  fixed  A.  Setting  A  =  A(x)  determined  as  above,  (Vf  +  ArVh)r 
is  in  M(x)  and  hence  orthogonal  to  x(t),  since  x(t)  is  a  normalized  geodesic.  This 
gives  (12.40).  I 

It  should  be  noted  that  we  proved  a  simplified  version  of  this  result  in  Chap.  11. 
There  the  result  was  given  only  for  the  optimal  point  x*,  although  it  was  valid  for 
any  curve.  Here  we  have  shown  that  essentially  the  same  result  is  valid  at  any  point 
provided  that  we  move  along  a  geodesic. 


Rate  of  Convergence 

We  now  prove  the  main  theorem  regarding  the  rate  of  convergence.  We  assume 
that  all  functions  are  three  times  continuously  differentiable  and  that  every  point 
in  a  region  near  the  solution  x*  is  regular.  This  theorem  only  establishes  the  rate 
of  convergence  and  not  convergence  itself  so  for  that  reason  the  stated  hypotheses 
assume  that  the  method  of  geodesic  descent  generates  a  sequence  {x^}  converging 
to  x*. 

Theorem.  Theorem.  Let  x*  be  a  local  solution  to  the  problem  (12.30)  and  suppose  that 
A  and  a  >  0  are,  respectively,  the  largest  and  smallest  eigenvalues  of  L(x*)  restricted  to 
the  tangent  sub  space  Mix*).  If  { xk)  is  a  sequence  generated  by  the  method  of  geodesic 
descent  that  converges  to  x*,  then  the  sequence  of  objective  values  {/(**)}  converges  to 
fix*)  linearly  with  a  ratio  no  greater  than  [(A  -  a)/ (A  +  a)]2. 

Proof.  Without  loss  of  generality  we  may  assume  fix*)  =  0.  Given  a  point  x^  it 
will  be  convenient  to  define  its  distance  from  the  solution  point  x*  as  the  arc  length 
of  the  geodesic  connecting  x*  and  x^.  Thus  if  x(t)  is  a  parameterized  version  of  the 
geodesic  with  x(0)  =  x*,  \x(t)\  =  1,  x(T)  =  x*,  then  T  is  the  distance  of  x^  from 
x*.  Associated  with  such  a  geodesic  we  also  have  the  family  y  ft),  0  <  t  <  T,  of 
corresponding  projected  gradients  y (t)  =  /X(x,  A(x))r,  and  Hessians  L (t)  =  L(x(0). 
We  write  yk  =  y(xk)9  Lk  =  L(x*). 

We  now  derive  an  estimate  for  f(xk).  Using  the  geodesic  discussed  above  we  can 
write  (setting  xk  =  x(T)) 

/(X*)  -  f(xk)  =  -f(xk)  =  -y'kxkT  +  1  T2xk  hkxk  +  o(T2), 


(12.42) 
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which  follows  from  Proposition  1 .  We  also  have 

yk  =  -y(x*)  +  y(x*)  =  y  kT  +  o(T).  (12.43) 

But  differentiating  (12.34)  we  obtain 

n  =  L  A  +  Vh(xOrA[,  (12.44) 

and  hence  if  is  the  projection  matrix  onto  M(xk)  =  Mk,  we  have 

P  ijk  =  P*L*x*.  (12.45) 

Multiplying  (12.43)  by  P^  and  accounting  for  P^y^  =  yk  we  have 

VkykT  =  yk  +  o(T).  (12.46) 

Substituting  (12.45)  into  this  we  obtain 

P*L*x*r  =  y^  +  o(T). 

Since  Pkxk  =  xk  we  have,  defining  Lk  =  PkLkPk, 

L  kxkT  =  yk  +  o(T).  (12.47) 

The  matrix  is  related  to  L Mk,  the  restriction  of  to  Mk,  the  only  difference 
being  that  while  L,Mk  is  defined  only  on  Mk,  the  matrix  is  defined  on  all  of  En 
but  in  such  a  way  that  it  agrees  with  EMk  on  Mk  and  is  zero  on  M k  .  The  matrix  Ek 
is  not  invertible,  but  for  yk  £  Mk  there  is  a  unique  solution  z  e  Mk  to  the  equation 

L^z  =  yk  which  we  denote1  Ek  yk.  With  this  notation  we  obtain  from  (12.47) 

xkT  =  Tk~lyk  +  o(T).  (12.48) 

Substituting  this  last  result  into  (12.42)  and  accounting  for  \yk\  =  0(T)  (see  (12.43)) 
we  have 

f(*k)  =  ^yflTy*  +  o(T2),  (12.49) 

which  expresses  the  objective  value  at  xk  in  terms  of  the  projected  gradient. 

Since  |x^|  =  1  and  since  Ek  — >  L  as  »  x*,  we  see  from  (12.47)  that 

o(T)  +  aT  <  |y*|  <  AT  +  o(T),  (12.50) 

which  means  that  not  only  do  we  have  \yk\  =  0(T ),  which  was  known  before,  but 
also  \yk\  A  o(T).  We  may  therefore  write  our  estimate  (12.49)  in  the  alternate  form 


1  Actually  a  more  standard  procedure  is  to  define  the  pseudoinverse  lJk,  and  then  z  =  L^y*.. 
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and  since  o(T 2)  ^  y^L^.  yk  =  0(T 2),  we  have 

/(**)  =  ^yp^W1  +  0(T )), 


(12.51) 


(12.52) 


which  is  the  desired  estimate. 

Next,  we  estimate  /(x^+i)  in  terms  of  f(xk).  Given  now  let  x(t ),  t  >  0,  be 
the  normalized  geodesic  emanating  from  x^  =  x(0)  in  the  direction  of  the  negative 
projected  gradient,  that  is, 

x(0)  =  xk  =  -yk/\yk\. 

Then 

t 2 

/(*(*))  =  f(xk)  +  tyTkxk  +  -x[L*  +  o(t2).  (12.53) 

This  is  minimized  at 

y  Txk 

4  =  -t j^—+o(tk).  (12.54) 

Lkxk 

In  view  of  (12.50)  this  implies  that  4  =  O(T),  4  ^  o(T).  Thus  4  goes  to  zero  at 
essentially  the  same  rate  as  T.  Thus  we  have 


1  (\TXk)2 

f(xk+i)  =  f(xk)  -  +  o(T2),  (12.55) 

2  x[  L*x* 

Using  the  same  argument  as  before  we  can  express  this  as 

1  (y^yA2 

f(xk)  -f(xk+l)  =  --£-*-(1  +  0(T)),  (12.56) 

2  y[  L*y* 

which  is  the  other  required  estimate. 

Finally,  dividing  (12.56)  by  (12.52)  we  find 


f(xk)  -  f(xk+ 1)  (y[yft)2(!  +  0(T )) 


/(Xi) 


-1 


(ykUyk)(y jU  yk) 


and  thus 


/(X/t+i)  = 


1  - 


(y[yt)2(i  +  O(T)) 


(ykUyk)(ykLk  yk ) 


/(xO- 


(12.57) 


(12.58) 


Using  the  fact  that  — >  L*  and  applying  the  Kantorovich  inequality  leads  to 


f(*k+ t)  < 


A  -  a 
.A  +  $ 


+  o(r) 


/(x*).  I 


(12.59) 
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Problems  with  Inequalities 

The  idealized  version  of  gradient  projection  could  easily  be  extended  to  prob¬ 
lems  having  nonlinear  inequalities  as  well  as  equalities  by  following  the  pattern  of 
Sect.  12.4.  Such  an  extension,  however,  has  no  real  value,  since  the  idealized  scheme 
cannot  be  implemented.  The  idealized  procedure  was  devised  only  as  a  technique 
for  analyzing  the  asymptotic  rate  of  convergence  of  the  analytically  more  complex, 
but  more  practical,  gradient  projection  method. 

The  analysis  of  the  idealized  version  of  gradient  projection  given  above,  never¬ 
theless,  does  apply  to  problems  having  inequality  as  well  as  equality  constraints.  If 
a  computationally  feasible  procedure  is  employed  that  avoids  jamming  and  does  not 
bounce  on  and  off  constraint  boundaries  an  infinite  number  of  times,  then  near  the 
solution  the  active  constraints  will  remain  fixed.  This  means  that  near  the  solution 
the  method  acts  just  as  if  it  were  solving  a  problem  having  the  active  constraints  as 
equality  constraints.  Thus  the  asymptotic  rate  of  convergence  of  the  gradient  projec¬ 
tion  method  applied  to  a  problem  with  inequalities  is  also  given  by  (12.59)  but  with 
L(x*)  and  M(x*)  (and  hence  a  and  A)  determined  by  the  active  constraints  at  the  so¬ 
lution  point  x*.  In  every  case,  therefore,  the  rate  of  convergence  is  determined  by  the 
eigenvalues  of  the  same  restricted  Hessian  that  arises  in  the  necessary  conditions. 


12.6  The  Reduced  Gradient  Method 

From  a  computational  viewpoint,  the  reduced  gradient  method,  discussed  in  this 
section  and  the  next,  is  closely  related  to  the  simplex  method  of  linear  programming 
in  that  the  problem  variables  are  partitioned  into  basic  and  nonbasic  groups.  From 
a  theoretical  viewpoint,  the  method  can  be  shown  to  behave  very  much  like  the 
gradient  projection  method. 


Linear  Constraints 


Consider  the  problem 


minimize  /(x) 
subject  to  Ax  =  b,  x  >  0, 


(12.60) 


where  x  e  En,  b  e  Em,  A  is  m  x  n,  and  /  is  a  function  in  C2.  The  constraints  are 
expressed  in  the  format  of  the  standard  form  of  linear  programming.  For  simplic¬ 
ity  of  notation  it  is  assumed  that  each  variable  is  required  to  be  non-negative — if 
some  variables  were  free,  the  procedure  (but  not  the  notation)  would  be  somewhat 
simplified. 

We  invoke  the  nondegeneracy  assumptions  that  every  collection  of  m  columns 
from  A  is  linearly  independent  and  every  basic  solution  to  the  constraints  has  m 


12.6  The  Reduced  Gradient  Method 


379 


strictly  positive  variables.  With  these  assumptions  any  feasible  solution  will  have  at 
most  n-m  variables  taking  the  value  zero.  Given  a  vector  x  satisfying  the  constraints, 
we  partition  the  variables  into  two  groups:  x  =  (y,  z)  where  y  has  dimension  m  and 
z  has  dimension  n-m.  This  partition  is  formed  in  such  a  way  that  all  variables  in 
y  are  strictly  positive  (for  simplicity  of  notation  we  indicate  the  basic  variables  as 
being  the  first  m  components  of  x  but,  of  course,  in  general  this  will  not  be  so).  With 
respect  to  the  partition,  the  original  problem  can  be  expressed  as 

minimize  /( y,  z)  (12.61a) 

subject  to  By  +  Cz  =  b  (12.61b) 

y  >  0,  z  >  0,  (12.61c) 

where,  of  course,  A  =  [B,  C],  We  can  regard  z  as  consisting  of  the  indepen¬ 
dent  variables  and  y  the  dependent  variables,  since  if  z  is  specified,  (12.61b)  can 
be  uniquely  solved  for  y.  Furthermore,  a  small  change  Az  from  the  original  value 
that  leaves  z  +  Az  nonnegative  will,  upon  solution  of  (12.61b),  yield  another  feasible 
solution,  since  y  was  originally  taken  to  be  strictly  positive  and  thus  y  +  Ay  will 
also  be  positive  for  small  Ay.  We  may  therefore  move  from  one  feasible  solution  to 
another  by  selecting  a  Az  and  moving  z  on  the  line  z  +  a\z,  a  >  0.  Accordingly, 

y  will  move  along  a  corresponding  line  y  +  a  Ay.  If  in  moving  this  way  some  vari¬ 

able  becomes  zero,  a  new  inequality  constraint  becomes  active.  If  some  independent 
variable  becomes  zero,  a  new  direction  Az  must  be  chosen.  If  a  dependent  (basic) 
variable  becomes  zero,  the  partition  must  be  modified.  The  zero- valued  basic  vari¬ 
able  is  declared  independent  and  one  of  the  strictly  positive  independent  variables 
is  made  dependent.  Operationally,  this  interchange  will  be  associated  with  a  pivot 
operation. 

The  idea  of  the  reduced  gradient  method  is  to  consider,  at  each  stage,  the  problem 
only  in  terms  of  the  independent  variables.  Since  the  vector  of  dependent  variables 
y  is  determined  through  the  constraints  (12.61b)  from  the  vector  of  independent 
variables  z,  the  objective  function  can  be  considered  to  be  a  function  of  z  only. 
Hence  a  simple  modification  of  steepest  descent,  accounting  for  the  constraints,  can 
be  executed.  The  gradient  with  respect  to  the  independent  variables  z  (the  reduced 
gradient)  is  found  by  evaluating  the  gradient  of  /(B_1b  -  B_1  Cz  z).  It  is  equal  to 

rr  =  Vz/(y,  z)  -  Vy/(y,  z)B  ‘C.  (12.62) 

It  is  easy  to  see  that  a  point  (y,  z)  satisfies  the  first-order  necessary  conditions  for 
optimality  if  and  only  if 


r;  =  0  for  all  Zi  >  0 
r;  >  0  for  all  Zi  =  0. 

In  the  active  set  form  of  the  reduced  gradient  method  the  vector  z  is  moved  in 
the  direction  of  the  reduced  gradient  on  the  working  surface.  Thus  at  each  step,  a 
direction  of  the  form 
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(-ruUW(z) 
Zl  \  0,  ie  W( z) 


is  determined  and  a  descent  is  made  in  this  direction.  The  working  set  is  augmented 
whenever  a  new  variable  reaches  zero;  if  it  is  a  basic  variable,  a  new  partition  is  also 
formed.  If  a  point  is  found  where  rt  =  0  for  all  i  £  W( z)  (representing  a  vanishing 
reduced  gradient  on  the  working  surface)  but  0  <  0  for  some  j  e  W( z),  then  j  is 
deleted  from  W( z)  as  in  the  standard  active  set  strategy. 

It  is  possible  to  avoid  the  pure  active  set  strategy  by  moving  away  from  our  active 
constraint  whenever  that  would  lead  to  an  improvement,  rather  than  waiting  until  an 
exact  minimum  on  the  working  surface  is  found.  Indeed,  this  type  of  procedure  is 
often  used  in  practice.  One  version  progresses  by  moving  the  vector  z  in  the  direc¬ 
tion  of  the  overall  negative  reduced  gradient,  except  that  zero-valued  components  of 
z  that  would  thereby  become  negative  are  held  at  zero.  One  step  of  the  procedure  is 
as  follows: 


1 .  Let  A Zi  = 


-rt  if  r/ <  Oorzj  >  0 
0  otherwise. 


2.  If  Az  is  zero,  stop;  the  current  point  is  a  solution.  Otherwise,  find  Ay  = 
-B'CAz. 

3.  Findaq,  <X2,  <23  achieving,  respectively, 


maxja  :  y  +  a\y  >  0} 
ma x{cr  :  z  +  a\z  >  0} 
min  {/(x  +  <xAx)  :  0  <  or  <  ori,  0  <  cr  <  ai\ 


Let  x  =  x  +  <23  Ax. 

4.  If  0^3  <  a  1,  return  to  (12.1).  Otherwise,  declare  the  vanishing  variable  in  the 
dependent  set  independent  and  declare  a  strictly  positive  variable  in  the  inde¬ 
pendent  set  dependent.  Update  B  and  C. 

Example.  We  consider  the  example  presented  in  Sect.  12.4  where  the  projected 
negative  gradient  was  computed: 

minimize  +  xj  +  -  2xi  -  3^4 

subject  to  2xj  +  X2  +  X3  +  4x4  =  7 

Xl  +  X2  +  2X3  +  *4  =  6 
Xi  >  0,  i  =  1, 2, 3, 4. 

We  are  given  the  feasible  point  x  =  (2, 2, 1,0).  We  may  select  any  two  of  the  strictly 
positive  variables  to  be  the  basic  variables.  Suppose  y  =  (x\,  xi)  is  selected.  In 
standard  form  the  constraints  are  then 


X\  +  0  —  X3  +  3x4  —  1 
0  +  X2  +  3X3  -  2x4  =  5 
Xi  >0,  i  =  1, 2, 3, 4. 
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The  gradient  at  the  current  point  is  g  =  (2, 4, 2,  -3).  The  corresponding  reduced 
gradient  (with  respect  to  z  =  (*3,  *4))  is  then  found  by  pricing-out  in  the  usual 
manner.  The  situation  at  the  current  point  can  then  be  summarized  by  the  tableau 


Variable 
Constraints 

T 

r 1 

Current  value 

Tableau  for  Example 


*1 

x2 

*3 

x4 

1 

0 

-1 

3 

1 

0 

1 

3 

-2 

5 

0 

0 

-8 

-1 

2 

2 

1 

0 

In  this  solution  X3  and  X4  would  be  increased  together  in  a  ratio  of  eight  to  one. 
As  they  increase,  x\  and  X2  would  follow  in  such  a  way  as  to  keep  the  constraints 
satisfied.  Overall,  in  E 4,  the  implied  direction  of  movement  is  thus 

d  =  (5,-22,  8,1). 

If  the  reader  carefully  supplies  the  computational  details  not  shown  in  the  presenta¬ 
tion  of  the  example  as  worked  here  and  in  Sect.  12.4,  he  will  undoubtedly  develop  a 
considerable  appreciation  for  the  relative  simplicity  of  the  reduced  gradient  method. 

It  should  be  clear  that  the  reduced  gradient  method  can,  as  illustrated  in  the  ex¬ 
ample  above,  be  executed  with  the  aid  of  a  tableau.  At  each  step  the  tableau  of 
constraints  is  arranged  so  that  an  identity  matrix  appears  over  the  m  dependent  vari¬ 
ables,  and  thus  the  dependent  variables  can  be  easily  calculated  from  the  values  of 
the  independent  variables.  The  reduced  gradient  at  any  step  is  calculated  by  evaluat¬ 
ing  the  ^-dimensional  gradient  and  “pricing  out”  the  dependent  variables  just  as  the 
reduced  cost  vector  is  calculated  in  linear  programming.  And  when  the  partition  of 
basic  and  non-basic  variables  must  be  changed,  a  simple  pivot  operation  is  all  that 
is  required. 


Global  Convergence 

The  perceptive  reader  will  note  the  direction  finding  algorithm  that  results  from  the 
second  form  of  the  reduced  gradient  method  is  not  closed,  since  slight  movement 
away  from  the  boundary  of  an  inequality  constraint  can  cause  a  sudden  change  in 
the  direction  of  search.  Thus  one  might  suspect,  and  correctly  so,  that  this  method 
is  subject  to  jamming.  However,  a  trivial  modification  will  yield  a  closed  mapping; 
and  hence  global  convergence.  This  is  discussed  in  Exercise  19. 
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Nonlinear  Constraints 

The  generalized  reduced  gradient  method  solves  nonlinear  programming  problems 
in  the  standard  form 


minimize  /(x) 

subject  to  h(x)  =  0,  a  <  x  <  b, 

where  h(x)  is  of  dimension  m.  A  general  nonlinear  programming  problem  can  al¬ 
ways  be  expressed  in  this  form  by  the  introduction  of  slack  variables,  if  required, 
and  by  allowing  some  components  of  a  and  b  to  take  on  the  values  +oo  or  -oo,  if 
necessary. 

In  a  manner  quite  analogous  to  that  of  the  case  of  linear  constraints,  we  introduce 
a  nondegeneracy  assumption  that,  at  each  point  x,  hypothesizes  the  existence  of  a 
partition  of  x  into  x  =  (y,  z)  having  the  following  properties: 

(i)  y  is  of  dimension  m,  and  z  is  of  dimension  n  -  m. 

(ii)  If  a  =  (ay,  a z)  and  b  =  (by,  bz)  are  the  corresponding  partitions  of  a,  b,  then 
ay  <  y  <  by. 

(iii)  The  mxm  matrix  Vyh(y,  z)  is  nonsingular  at  x  =  (y,  z). 

Again  y  and  z  are  referred  to  as  the  vectors  of  dependent  and  independent  vari¬ 
ables ,  respectively. 

The  reduced  gradient  (with  respect  to  z)  is  in  this  case: 

rr  =  Vz/(y,  z)  +  Vvzh(y,  z), 


where  A  satisfies 

Vy/(y,  z)  +  Vvyh(y,  z)  =  0. 

Equivalently,  we  have 

rr  =  Vz/( y,  z)  -  Vy/(y,  z)[V,h(y.  z)]-1Vzh(y,  z).  (12.63) 

The  actual  procedure  is  roughly  the  same  as  for  linear  constraints  in  that  moves 
are  taken  by  changing  z  in  the  direction  of  the  negative  reduced  gradient  (with 
components  of  z  on  their  boundary  held  fixed  if  the  movement  would  violate  the 
bound).  The  difference  here  is  that  although  z  moves  along  a  straight  line  as  before, 
the  vector  of  dependent  variables  y  must  move  nonlinearly  to  continuously  satisfy 
the  equality  constraints.  Computationally,  this  is  accomplished  by  first  moving  lin¬ 
early  along  the  tangent  to  the  surface  defined  by  z  — »  z  +  Az,  y  — >  y  +  Ay  with 
Ay  =  -[Vyh]-1  VzhAz.  Then  a  correction  procedure,  much  like  that  employed  in  the 
gradient  projection  method,  is  used  to  return  to  the  constraint  surface  and  the  mag¬ 
nitude  bounds  on  the  dependent  variables  are  checked  for  feasibility.  As  with  the 
gradient  projection  method,  a  feasibility  tolerance  must  be  introduced  to  acknowl¬ 
edge  the  impossibility  of  returning  exactly  to  the  constraint  surface.  An  example 
corresponding  to  ^  =  3,  m-  1,  a  -  0,  b  =  +oo  is  shown  in  Fig.  12.8. 
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To  return  to  the  surface  once  a  tentative  move  along  the  tangent  is  made,  an 
iterative  scheme  is  employed.  If  the  point  was  the  point  at  the  previous  step,  then 
from  any  point  x  =  (v,  w)  near  x^  one  gets  back  to  the  constraint  surface  by  solving 
the  nonlinear  equation 

h(y,  w)  =  0  (12.64) 

for  y  (with  w  fixed).  This  is  accomplished  through  the  iterative  process 

y;+ 1  =  y j  -  [Vyh(xt)]"1h(yi,  w),  (12.65) 

which,  if  started  close  enough  to  x^,  will  produce  {y  •}  with  y j  — »  y,  solving  (12.64). 

The  reduced  gradient  method  suffers  from  the  same  basic  difficulties  as  the  gradi¬ 
ent  projection  method,  but  as  with  the  latter  method,  these  difficulties  can  all  be  more 
or  less  successfully  resolved.  Computation  is  somewhat  less  complex  in  the  case  of 
the  reduced  gradient  method,  because  rather  than  compute  with  [Vh(x)Vh(x)rf  1  at 
each  step,  the  matrix  [Vyh(y,  z)]_1  is  used. 


f 


Fig.  12.8  Reduced  gradient  method 


12.7  Convergence  Rate  of  the  Reduced  Gradient  Method 


As  argued  before,  for  purposes  of  analyzing  the  rate  of  convergence,  it  is  sufficient 
to  consider  the  problem  having  only  equality  constraints 

minimize  /(x)  n?661 

subject  to  h(x)  =  0. 

We  then  regard  the  problem  as  being  defined  over  a  surface  12  of  dimension 
n  -  m.  At  this  point  it  is  again  timely  to  consider  the  view  of  our  bug,  who  lives  on 
this  constraint  surface.  Invariably,  he  continues  to  regard  the  problem  as  extremely 
elementary,  and  indeed  would  have  little  appreciation  for  the  complexity  that  seems 
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to  face  us.  To  him  the  problem  is  an  unconstrained  problem  in  n  —  m  dimensions  not, 
as  we  see  it,  a  constrained  problem  in  n  dimensions.  The  bug  will  tenaciously  hold 
to  the  method  of  steepest  descent.  We  can  emulate  him  provided  that  we  know  how 
he  measures  distance  on  his  surface  and  thus  how  he  calculates  gradients  and  what 
he  considers  to  be  straight  lines. 

Rather  than  imagine  that  the  measure  of  distance  on  his  surface  is  the  one  that 
would  be  inherited  from  us  in  n  dimensions,  as  we  did  when  studying  the  gradient 
projection  method,  we,  in  this  instance,  follow  the  construction  shown  in  Fig.  12.9. 
In  our  ^-dimensional  space,  n  -  m  coordinates  are  selected  as  independent  vari¬ 
ables  in  such  a  way  that,  given  their  values,  the  values  of  the  remaining  (dependent) 
variables  are  determined  by  the  surface.  There  is  already  a  coordinate  system  in  the 
space  of  independent  variables,  and  it  can  be  used  on  the  surface  by  projecting  it  par¬ 
allel  to  the  space  of  the  remaining  dependent  variables.  Thus,  an  arc  on  the  surface  is 
considered  to  be  straight  if  its  projection  onto  the  space  of  independent  variables  is  a 
segment  of  a  straight  line.  With  this  method  for  inducing  a  geometry  on  the  surface, 
the  bug’s  notion  of  steepest  descent  exactly  coincides  with  an  idealized  version  of 
the  reduced  gradient  method. 


Fig.  12.9  Induced  coordinate  system 


In  the  idealized  version  of  the  reduced  gradient  method  for  solving  (12.66),  the 
vector  x  is  partitioned  as  x  =  (y,  z)  where  y  e  Em,  z  e  En~m.  It  is  assumed  that  the 
mx  m  matrix  Vyh(y,  z)  is  nonsingular  throughout  a  given  region  of  interest.  (With 
respect  to  the  more  general  problem,  this  region  is  a  small  neighborhood  around  the 
solution  where  it  is  not  necessary  to  change  the  partition.)  The  vector  y  is  regarded 
as  an  implicit  function  of  z  through  the  equation 

h(y(z),  z)  =  0.  (12.67) 

The  ordinary  method  of  steepest  descent  is  then  applied  to  the  function  q(z)  = 
/( y(z),  z).  We  note  that  the  gradient  rT  of  this  function  is  given  by  (12.63). 

Since  the  method  is  really  just  the  ordinary  method  of  steepest  descent  with  re¬ 
spect  to  z,  the  rate  of  convergence  is  determined  by  the  eigenvalues  of  the  Hessian 
of  the  function  q  at  the  solution.  We  therefore  turn  to  the  question  of  evaluating  this 
Hessian. 
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Denote  by  Y(z)  the  first  derivatives  of  the  implicit  function  y(z),  that  is,  Y(z)  = 
Vzy(z).  Explicitly, 

Y(z)  =  -[Vyh(y,  z)]_1Vzh(y,  z).  (12.68) 

For  any  A  e  Em  we  have 

q(z)  =  /( y(z),  z)  =  /( y(z),  z)  +  Arh(y(z),  z).  (12.69) 

Thus 

Vq(z)  =  [Vy/(; y,  z)  +  ArVyh(y,  z)]Y(z)  +  Vz/( y,  z)  +  ArVzh(y,  z).  (12.70) 

Now  if  at  a  given  point  x*  =  (y*,  z*)  =  (y(z*),  z*),  we  let  A  satisfy 

Vy/(y*.  z*)  +  ArVyh(y*,  z*)  =  0;  (12.71) 

'T 

then  introducing  the  Lagrangian  /( y,  z,  A)  =  /( y,  z)  +  A  h(y,  z),  we  obtain  by 
differentiating  (12.70) 

V2g(z*)  =  Y(z*)TV2yZ(y*,  z*)Y(z*)  +  V2yZ(y*,  z*)Y(z*) 

+Y(z*)rV2zZ(y*,  z*)  +  V2zZ(y*,  z*).  (12.72) 

Or  defining  the  nx(n  —  m)  matrix 


where  I  is  the  (n-m)x(n-  m )  identity,  we  have 


Q  =  V2q(  z*)  =  CrL(x*)C. 


(12.73) 


(12.74) 


The  matrix  L(x*)  is  the  n  x  n  Hessian  of  the  Lagrangian  at  x*,  and  V2g(z*)  is  an 
(n  -  m)  x  (n  -  m)  matrix  that  is  a  restriction  of  L(x*)  to  the  tangent  subspace  M, 
but  it  is  not  the  usual  restriction.  We  summarize  our  conclusion  with  the  following 
theorem. 


Theorem.  Let  x*  be  a  local  solution  of  problem  (12.66).  Suppose  that  the  idealized  reduced 
gradient  method  produces  a  sequence  {jc/J  converging  to  x*  and  that  the  partition  x  =  (y,  z) 
is  used  throughout  the  tail  of  the  sequence.  LetL  be  the  Hessian  of  the  Lagrangian  atx*  and 
define  the  matrix  C  by  (12.73)  and  (12.68).  Then  the  sequence  of  objective  values  {fixjfi) 
converges  to  fix*)  linearly  with  a  ratio  no  greater  than  [(5  -  b)/(B  +  b )]2  where  b  and  B 
are,  respectively,  the  smallest  and  largest  eigenvalues  of  the  matrix  Q  -  C  LC. 

To  compare  the  matrix  CrLC  with  the  usual  restriction  of  L  to  M  that  determines 
the  convergence  rate  of  most  methods,  we  note  that  the  nx  (n  -  m)  matrix  C  maps 
Az  e  En~m  into  (Ay,  Az)  e  En  lying  in  the  tangent  subspace  M;  that  is,  VyhAy  + 
VzhAz  =  0.  Thus  the  columns  of  C  form  a  basis  for  the  subspace  M.  Next  note  that 
the  columns  of  the  matrix 

E  =  €(€rC)“1/2 


(12.75) 
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form  an  orthonormal  basis  for  M,  since  each  column  of  E  is  just  a  linear  combina¬ 
tion  of  columns  of  C  and  by  direct  calculation  we  see  that  ErE  =  I.  Thus  by  the 
procedure  described  in  Sect.  11.6  we  see  that  a  representation  for  the  usual  restric¬ 
tion  of  L  to  M  is 

Lm  =  (C7'C)“1/2C7'LC(CrC)“1/2.  (12.76) 

Comparing  (12.76)  with  (12.74)  we  deduce  that 

Q  =  (€7'C)1/2LM(CrC)1/2.  (12.77) 

This  means  that  the  Hessian  matrix  for  the  reduced  gradient  method  is  the  restriction 
of  L  to  M  but  pre-  and  post-multiplied  by  a  positive  definite  symmetric  matrix. 

The  eigenvalues  of  Q  depend  on  the  exact  nature  of  C  as  well  as  L M.  Thus,  the 
rate  of  convergence  of  the  reduced  gradient  method  is  not  coordinate  independent 
but  depends  strongly  on  just  which  variables  are  declared  as  independent  at  the  final 
stage  of  the  process.  The  convergence  rate  can  be  either  faster  or  slower  than  that 
of  the  gradient  projection  method.  In  general,  however,  if  C  is  well-behaved  (that 
is,  well-conditioned),  the  ratio  of  eigenvalues  for  the  reduced  gradient  method  can 
be  expected  to  be  the  same  order  of  magnitude  as  that  of  the  gradient  projection 
method.  If,  however,  C  should  be  ill-conditioned,  as  would  arise  in  the  case  where 
the  implicit  equation  h(y,  z)  =  0  is  itself  ill-conditioned,  then  it  can  be  shown  that 
the  eigenvalue  ratio  for  the  reduced  gradient  method  will  most  likely  be  considerably 
worsened.  This  suggests  that  care  should  be  taken  to  select  a  set  of  basic  variables  y 
that  leads  to  a  well-behaved  C  matrix. 

Example  (The  Hanging  Chain  Problem).  Consider  again  the  hanging  chain  prob¬ 
lem  discussed  in  Sect.  1 1 .4.  This  problem  can  be  used  to  illustrate  a  wide  assortment 
of  theoretical  principles  and  practical  techniques.  Indeed,  a  study  of  this  example 
clearly  reveals  the  predictive  power  that  can  be  derived  from  an  interplay  of  theory 
and  physical  intuition. 

The  problem  is 

n 

minimize  Y  (n  ~  i  +  0.5)y* 

i=i 

n 

subject  to  Y  yi  -  0 

i  V^=16’ 

i=  1  y 

where  in  the  original  formulation  n-  20. 

This  problem  has  been  solved  numerically  by  the  reduced  gradient  method.*  An 
initial  feasible  solution  was  the  triangular  shape  shown  in  Fig.  12.10a  with 


*  The  exact  solution  is  obviously  symmetric  about  the  center  of  the  chain,  and  hence  the  problem 
could  be  reduced  to  having  ten  links  and  only  one  constraint.  However,  this  symmetry  disappears 
if  the  first  constraint  value  is  specified  as  nonzero.  Therefore  for  generality  we  solve  the  full  chain 
problem. 
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(-0.6,  1</<10 

y'  ~  \  0.6,  1 1  <  i  <  20. 


Table  12.1  Results  of  original  chain  problem 


Iteration 

Value 

Solution  (1/2  of  chain) 

0 

-60.00000 

Vi  =  -0.8148260 

10 

-66.47610 

V2  =  -0.7826505 

20 

-66.52180 

V3  =  -0.7429208 

30 

-66.53595 

V4  =  -0.6930959 

40 

-66.54154 

Vs  =  -0.6310976 

50 

-66.54537 

V6  =  -0.5541078 

60 

-66.54628 

y7  =  -0.4597160 

69 

-66.54659 

Vs  =  -0.3468334 

70 

-66.54659 

yg  =  -0.2169879 

Lagrange  multipliers  -9.993817, 

-6.763148 

V10  =  -0.07492541 

The  results  obtained  from  a  reduced  gradient  package  are  shown  in  Table  12.1. 
Note  that  convergence  is  obtained  in  approximately  70  iterations. 

The  Lagrange  multipliers  of  the  constraints  are  a  by-product  of  the  solution. 
These  can  be  used  to  estimate  the  change  in  solution  value  if  the  constraint  values 
are  changed  slightly.  For  example,  suppose  we  wish  to  estimate,  without  resolving 
the  problem,  the  change  in  potential  energy  (the  objective  function)  that  would  re¬ 
sult  if  the  separation  between  the  two  supports  were  increased  by,  say,  one  inch.  The 
change  can  be  estimated  by  the  formula  Au  =  -A2 / 12  =  0.0833  x  (6.76)  =  0.563. 
(When  solved  again  numerically  the  change  is  found  to  be  0.568.) 

Let  us  now  pose  some  more  challenging  questions.  Consider  two  variations  of  the 
original  problem.  In  the  first  variation  the  chain  is  replaced  by  one  having  twice  as 
many  links,  but  each  link  is  now  half  the  size  of  the  original  links.  The  overall  chain 
length  is  therefore  the  same  as  before.  In  the  second  variation  the  original  chain  is 
replaced  by  one  having  twice  as  many  links,  but  each  link  is  the  same  size  as  the 
original  links.  The  chain  length  doubles  in  this  case.  If  these  problems  are  solved  by 
the  same  method  as  the  original  problem,  approximately  how  many  iterations  will 
be  required — about  the  same  number,  many  more,  or  substantially  less? 

These  questions  can  be  easily  answered  by  using  the  theory  of  convergence  rates 
developed  in  this  chapter.  The  Hessian  of  the  Lagrangian  is 


L  —  F  +  A1H4  +  A2H2. 


However,  since  the  objective  function  and  the  first  constraint  are  both  linear,  the 
only  nonzero  term  in  the  above  equation  is  A2H2.  Furthermore,  since  convergence 
rates  depend  only  on  eigenvalue  ratios,  the  A2  can  be  ignored.  Thus  the  eigenvalues 
of  H2  determine  the  canonical  convergence  rate. 
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(a)  Original  eon  figuration  of  chain 


Fig.  12.10  The  chain  example,  (a)  Original  configuration  of  chain,  (b)  Final  configuration, 
(c)  Long  chain 
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It  is  easily  seen  that  H2  is  diagonal  with  zth  diagonal  term, 

(H  2)a  =  -(1  -yfr312. 


and  these  values  are  the  eigenvalues  of  H2.  The  canonical  convergence  rate  is  de¬ 
fined  by  the  eigenvalues  of  H22  in  the  ( n  -  2)-dimensional  tangent  subspace  M.  We 
cannot  exactly  determine  these  eigenvalues  without  a  lot  of  work,  but  we  can  assume 
that  they  are  close  to  the  eigenvalues  of  H22.  (Indeed,  a  version  of  the  Interlocking 
Eigenvalues  Lemma  states  that  the  n  —  2  eigenvalues  are  interlocked  with  the  eigen¬ 
values  of  H22.)  Then  the  convergence  rate  of  the  gradient  projection  method  will 
be  governed  by  these  eigenvalues.  The  reduced  gradient  method  will  most  likely  be 
somewhat  slower. 

The  eigenvalue  of  smallest  absolute  value  corresponds  to  the  center  links,  where 
yi  -  0.  Conversely,  the  eigenvalue  of  largest  absolute  value  corresponds  to  the  first 
or  last  link,  where  yi  is  largest  in  absolute  value.  Thus  the  relevant  eigenvalue  ratio 
is  approximately 

1  _  1 

(l-y2)3/2  (sin0)3/2’ 

where  6  is  the  angle  shown  in  Fig.  12.10b. 

For  very  little  effort  we  have  obtained  a  powerful  understanding  of  the  chain 
problem  and  its  convergence  properties.  We  can  use  this  to  answer  the  questions 
posed  earlier.  For  the  first  variation,  with  twice  as  many  links  but  each  of  half  size, 
the  angle  0  will  be  about  the  same  (perhaps  a  little  smaller  because  of  increased  flex¬ 
ibility  of  the  chain).  Thus  the  number  of  iterations  should  be  slightly  larger  because 
of  the  increase  in  6  and  somewhat  larger  again  because  there  are  more  variables 
(which  tends  to  increase  the  condition  number  of  CrC).  Note  in  Table  12.2  that 
about  122  iterations  were  required,  which  is  consistent  with  this  estimate. 

For  the  second  variation  the  chain  will  hang  more  vertically;  hence  y  1  will  be 
larger,  and  therefore  convergence  will  be  fundamentally  slower.  To  be  more  specific 
it  is  necessary  to  substitute  a  few  numbers  in  our  simple  formula.  For  the  original 
case  we  have  y  1  -  -.81.  This  yields 


r  =  (l-.812r3/2=4.9 


and  a  convergence  factor  of 


.44. 


This  is  a  modest  value  and  quite  consistent  with  the  observed  result  of  70  iterations 
for  a  reduced  gradient  method.  For  the  long  chain  we  can  estimate  that  y  1  ^98.  This 
yields 
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Table  12.2  Results  of  modified  chain  problems 


Short  links 

Long  chain 

Iteration 

Value 

Iteration 

Value 

0 

-60.00000 

0 

-366.6061 

10 

-66.45499 

10 

-375.6423 

20 

-66.56377 

20 

-375.9123 

40 

-66.58443 

50 

-376.5128 

60 

-66.59191 

100 

-377.1625 

80 

-66.59514 

200 

-377.8983 

100 

-66.59656 

500 

-378.7989 

120 

-66.59825 

1000 

-379.3012 

121 

-66.59827 

1500 

-379.4994 

122 

-66.59827 

2000 

-379.5965 

2500 

-379.6489 

y\  =  0.4109519 

yi  =  0.9886223 

r  =  (1  -  ,982r3/2  ^  127 

This  last  number  represents  extremely  slow  convergence.  Indeed,  since  (0.969)25  - 
0.44,  we  expect  that  it  may  easily  take  25  times  as  many  iterations  for  the  long  chain 
problem  to  converge  as  the  original  problem  (although  quantitative  estimates  of  this 
type  are  rough  at  best).  This  again  is  verified  by  the  results  shown  in  Table  12.2, 
where  it  is  indicated  that  over  2,500  iterations  were  required  by  a  version  of  the 
reduced  gradient  method. 


*12.8  *Variations 

It  is  possible  to  modify  either  the  gradient  projection  method  or  the  reduced  gradient 
method  so  as  to  move  in  directions  that  are  determined  through  additional  consid¬ 
erations.  For  example,  analogs  of  the  conjugate  gradient  method,  PARTAN,  or  any 
of  the  quasi-Newton  methods  can  be  applied  to  constrained  problems  by  handling 
constraints  through  projection  or  reduction.  The  corresponding  asymptotic  rates  of 
convergence  for  such  methods  are  easily  determined  by  applying  the  results  for  un¬ 
constrained  problems  on  the  (n-m)-dimensional  surface  of  constraints,  as  illustrated 
in  this  chapter. 

Although  such  generalizations  can  sometimes  lead  to  substantial  improvement 
in  convergence  rates,  one  must  recognize  that  the  detailed  logic  for  a  complicated 
generalization  can  become  lengthy.  If  the  method  relies  on  the  use  of  an  approxi¬ 
mate  inverse  Hessian  restricted  to  the  constraint  surface,  there  must  be  an  effective 
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procedure  for  updating  the  approximation  when  the  iterative  process  progresses 
from  one  set  of  active  constraints  to  another.  One  would  also  like  to  insure  that  the 
poor  eigenvalue  structure  sometimes  associated  with  quasi-Newton  methods  does 
not  dominate  the  short-term  convergence  characteristics  of  the  extended  method 
when  the  active  constraint  set  changes.  In  other  words,  one  would  like  to  be  able  to 
achieve  simultaneously  both  superlinear  convergence  and  a  guarantee  of  fast  single 
step  progress.  There  has  been  some  work  in  this  general  area  and  it  appears  to  be 
one  of  potential  promise. 


*Convex  Simplex  Method 

A  popular  modification  of  the  reduced  gradient  method,  termed  the  convex  simplex 
method ,  most  closely  parallels  the  highly  effective  simplex  method  for  solving  lin¬ 
ear  programs.  The  major  difference  between  this  method  and  the  reduced  gradient 
method  is  that  instead  of  moving  all  (or  several)  of  the  independent  variables  in  the 
direction  of  the  negative  reduced  gradient,  only  one  independent  variable  is  changed 
at  a  time.  The  selection  of  the  one  independent  variable  to  change  is  made  much  as 
in  the  ordinary  simplex  method. 

At  a  given  feasible  point,  let  x  =  (y,  z)  be  the  partition  of  x  into  dependent  and 
independent  parts,  and  assume  for  simplicity  that  the  bounds  on  x  are  x  >  0.  Given 
the  reduced  gradient  rT  at  the  current  point,  the  component  Zi  to  be  changed  is  found 
from: 

1.  Let  rn  -  minjr/}. 

i 

2.  Let  ri2Zi2  =  max{r,z,) 

i 

If  ri i  =  raza  -  0,  terminate.  Otherwise 
If  rn  <  -\ raznl  increase^ 

If  rn  >  —  | raznl  decrease za- 

The  rule  in  Step  2  amounts  to  selecting  the  variable  that  yields  the  best  potential 
decrease  in  the  cost  function.  The  rule  accounts  for  the  non-negativity  constraint 
on  the  independent  variables  by  weighting  the  cost  coefficients  of  those  variables 
that  are  candidates  to  be  decreased  by  their  distance  from  zero.  This  feature  ensures 
global  convergence  of  the  method. 

The  remaining  details  of  the  method  are  identical  to  those  of  the  reduced  gradient 
method.  Once  a  particular  component  of  z  is  selected  for  change,  according  to  the 
above  criterion,  the  corresponding  y  vector  is  computed  as  a  function  of  the  change 
in  that  component  so  as  to  continuously  satisfy  the  constraints.  The  component  of  z 
is  continuously  changed  until  either  a  local  minimum  with  respect  to  that  component 
is  attained  or  the  boundary  of  one  nonnegativity  constraint  is  reached. 

Just  as  in  the  discussion  of  the  reduced  gradient  method,  it  is  convenient,  for  pur¬ 
poses  of  convergence  analysis,  to  view  the  problem  as  unconstrained  with  respect 
to  the  independent  variables.  The  convex  simplex  method  is  then  seen  to  be  a  co¬ 
ordinate  descent  procedure  in  the  space  of  these  n  -  m  variables.  Indeed,  since  the 
component  selected  is  based  on  the  magnitude  of  the  components  of  the  reduced 
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gradient,  the  method  is  merely  an  adaptation  of  the  Gauss-Southwell  scheme  dis¬ 
cussed  in  Sect.  8.6  to  the  constrained  situation.  Hence,  although  it  is  difficult  to  pin 
down  precisely,  we  expect  that  it  would  take  approximately  n  -  m  steps  of  this  co¬ 
ordinate  descent  method  to  make  the  progress  of  a  single  reduced  gradient  step.  To 
be  competitive  with  the  reduced  gradient  method;  therefore,  the  difficulties  associ¬ 
ated  with  a  single  step — line  searching  and  constraint  evaluation — must  be  approx¬ 
imately  n  -  m  times  simpler  when  only  a  single  component  is  varied  than  when  all 
n  -  m  are  varied  simultaneously.  This  is  indeed  the  case  for  linear  programs  and 
for  some  quadratic  programs  but  not  for  nonlinear  problems  that  require  the  full 
line  search  machinery.  Hence,  in  general,  the  convex  simplex  method  may  not  be  a 
bargain. 


12.9  Summary 

The  concept  of  feasible  direction  methods  is  a  straightforward  and  logical  extension 
of  the  methods  used  for  unconstrained  problems  but  leads  to  some  subtle  difficulties. 
These  methods  are  susceptible  to  jamming  (lack  of  global  convergence)  because 
many  simple  direction  finding  mappings  and  the  usual  line  search  mapping  are  not 
closed. 

Problems  with  inequality  constraints  can  be  approached  with  an  active  set  strat¬ 
egy.  In  this  approach  certain  constraints  are  treated  as  active  and  the  others  are 
treated  as  inactive.  By  systematically  adding  and  dropping  constraints  from  the 
working  set,  the  correct  set  of  active  constraints  is  determined  during  the  search  pro¬ 
cess.  In  general,  however,  an  active  set  method  may  require  that  several  constrained 
problems  be  solved  exactly. 

The  most  practical  primal  methods  are  the  gradient  projection  methods  and 
the  reduced  gradient  method.  Both  of  these  basic  methods  can  be  regarded  as  the 
method  of  steepest  descent  applied  on  the  surface  defined  by  the  active  constraints. 
The  rate  of  convergence  for  the  two  methods  can  be  expected  to  be  approximately 
equal  and  is  determined  by  the  eigenvalues  of  the  Hessian  of  the  Lagrangian  re¬ 
stricted  to  the  subspace  tangent  to  the  active  constraints.  Of  the  two  methods,  the  re¬ 
duced  gradient  method  seems  to  be  best.  It  can  be  easily  modified  to  ensure  against 
jamming  and  it  requires  fewer  computations  per  iterative  step  and  therefore,  for  most 
problems,  will  probably  converge  in  less  time  than  the  gradient  projection  method. 


12.10  Exercises 

1 .  Show  that  the  Frank- Wolfe  method  is  globally  convergent  if  the  intersection  of 
the  feasible  region  and  the  objective  level  set  {x  :  /(x)  <  /(x0)}  is  bounded. 

2.  Sometimes  a  different  normalizing  term  is  used  in  ( 12.4).  Show  that  the  problem 
of  finding  d  =  (d\,  d2,  . . . ,  dn)  to 
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minimize  crd 

subject  to  Ad  <  0,  (2  | di\p)l/p  =  1 

i 

for  p  =  1  or  p  =  oo  can  be  converted  to  a  linear  program. 

3.  Perhaps  the  most  natural  normalizing  term  to  use  in  (12.4)  is  one  based  on  the 
Euclidean  norm.  This  leads  to  the  problem  of  finding  d  =  (d\,  rfe ,  • . . ,  dn )  to 

minimize  crd 

n 

subject  to  Ad  <0,  X  d2  =  1 . 

i=i 

Find  the  Karush-Kuhn-Tucker  necessary  conditions  for  this  problem  and  show 
how  they  can  be  solved  by  a  modification  of  the  simplex  procedure. 

4.  Let  Q  c  En  be  a  given  feasible  region.  A  set  T  c  E2n  consisting  of  pairs  (x,  d), 
with  x  <g  Q  and  d  a  feasible  direction  at  x,  is  said  to  be  a  set  of  uniformly  feasible 
direction  vectors  if  there  is  a  5  >  0  such  that  (x,  d)  e  T  implies  that  x  +  ad  is 
feasible  for  all  a,  0  <  a  <  6.  The  number  6  is  referred  to  as  the  feasibility 
constant  of  the  set  T. 

Let  L  c  E2n  be  a  set  of  uniformly  feasible  direction  vectors  for  Q,  with  feasi¬ 
bility  constant  6.  Define  the  mapping 

M$(x,  d)  =  {y  :  /( y)  <  /(x  +  rd)  for  all  r,  0  <  r  <  6;  y  =  x  +  ad , 

for  some  a,  0  <  a  <  oo,  ye  Q}. 

Show  that  if  d  ^  0,  the  map  is  closed  at  (x,  d). 

5.  Let  L  c  E2n  be  a  set  of  uniformly  feasible  direction  vectors  for  Q  with  feasibility 
constant  6.  For  s  >  0  define  the  map  or  L  by 

eM^(x,  d)  =  {y  :  /( y)  <  /(x  +  rd)  +  6:  for  all  r,  0  <  r  <  6;  y  =  x  +  ad, 

for  some  a,  0  <  a  <  oo,  y  e  Q}. 

The  map  corresponds  to  an  “inaccurate”  constrained  line  search.  Show  that 
this  map  is  closed  if  d  ^  0. 

6.  For  the  problem 


minimize  /(x) 

subject  to  ajx  <  bi,  i  =  1,2,  . . . ,  m 

consider  selecting  d  =  (d\,  d2 ,  . . . ,  dn )  at  a  feasible  point  x  by  solving  the 
problem 

minimize  V/(x)d 

subject  to  a^d  <  (bi  —  a J x)M,  i  =  1,2,  . . . ,  m 

i  \di\  =  1, 

i- 1 

where  M  is  some  given  positive  constant.  For  large  M  the  ith  inequality  of  this 
subsidiary  problem  will  be  active  only  if  the  corresponding  inequality  in  the 
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original  problem  is  nearly  active  at  x  (indeed,  note  that  M  — >  oo  corresponds  to 
Zoutendijk’s  method).  Show  that  this  direction  finding  mapping  is  closed  and 
generates  uniformly  feasible  directions  with  feasibility  constant  1  /M. 

7.  Generalize  the  method  of  Exercise  6  so  that  it  is  applicable  to  nonlinear  inequal¬ 
ities. 

8.  An  alternate,  but  equivalent,  definition  of  the  projected  gradient  p  is  that  it  is 
the  vector  solving 

minimize  ig  -  pi2 
subject  to  Atyp  =  0. 

Using  the  Karush-Kuhn-Tucker  necessary  conditions,  solve  this  problem  and 
thereby  derive  the  formula  for  the  projected  gradient. 

9.  Show  that  finding  the  d  that  solves 

minimize  grd 

subject  to  A^d  =  0,  |d|2  =  1 


gives  a  vector  d  that  has  the  same  direction  as  the  negative  projected  gradient. 

10.  Let  P  be  a  projection  matrix.  Show  that  Pr  =  P,  P2  =  P. 

11.  Suppose  A^  =  [ar,  A^]  so  that  A^  is  the  matrix  A^  with  the  row  ar  adjoined. 
Show  that  (AqATq)~l  can  be  found  from  (A^Ap-1  from  the  formula 


(a*  a  jr1 


s 


-sar  Al(AqAl)~l 

(a?aP->[i  +  aVa|(a^)->] 


where 

1 

aTa  -  aT Al(ATlAl)-1  Aqa 

Develop  a  similar  formula  for  (A^A^-)-1  in  terms  of  (AqAq)~l . 

12.  Show  that  the  gradient  projection  method  will  solve  a  linear  program  in  a  finite 
number  of  steps. 

13.  Suppose  that  the  projected  negative  gradient  d  is  calculated  satisfying 


-g  =  d  +  A[A 


and  that  some  component  \  of  A,  corresponding  to  an  inequality,  is  negative. 
Show  that  if  the  ith  inequality  is  dropped,  the  projection  d;  of  the  negative  gra¬ 
dient  onto  the  remaining  constraints  is  a  feasible  direction  of  descent. 

14.  Using  the  result  of  Exercise  13,  it  is  possible  to  avoid  the  discontinuity  at  d  =  0 
in  the  direction  finding  mapping  of  the  simple  gradient  projection  method.  At 
a  given  point  let  y  -  -  min{0,  A/},  with  the  minimum  taken  with  respect  to  the 
indices  i  corresponding  the  active  inequalities.  The  direction  to  be  taken  at  this 
point  is  d  =  -Pg  if  |Pg|  >  y,  or  d,  defined  by  dropping  the  inequality  i  for 
which  A i  =  -y,  if  |Pg|  <  y.  (In  case  of  equality  either  direction  is  selected.) 
Show  that  this  direction  finding  map  is  closed  over  a  region  where  the  set  of 
active  inequalities  does  not  change. 
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15.  Consider  the  problem  of  maximizing  entropy  discussed  in  Example  3,  Sect.  14.4. 
Suppose  this  problem  were  solved  numerically  with  two  constraints  by  the  gra¬ 
dient  projection  method.  Derive  an  estimate  for  the  rate  of  convergence  in  terms 
of  the  optimal  pi  s. 

16.  Find  the  geodesics  of 

(a)  a  two-dimensional  plane 

(b)  a  sphere. 

17.  Suppose  that  the  problem 


minimize  /(x) 
subject  to  h(x)  =  0 

is  such  that  every  point  is  a  regular  point.  And  suppose  that  the  sequence  of 
points  generated  by  geodesic  descent  is  bounded.  Prove  that  every  limit 

point  of  the  sequence  satisfies  the  first-order  necessary  conditions  for  a  con¬ 
strained  minimum. 

18.  Show  that,  for  linear  constraints,  if  at  some  point  in  the  reduced  gradient  method 
Az  is  zero,  that  point  satisfies  the  Karush-Kuhn-Tucker  first-order  necessary 
conditions  for  a  constrained  minimum. 

19.  Consider  the  problem 


minimize  /(x) 
subject  to  Ax  =  b,  x  >  0, 

where  A  is  m  x  n.  Assume  /  e  C1,  that  the  feasible  set  is  bounded,  and  that 
the  nondegeneracy  assumption  holds.  Suppose  a  “modified”  reduced  gradient 
algorithm  is  defined  following  the  procedure  in  Sect.  12.6  but  with  two  modifi¬ 
cations:  (1)  the  basic  variables  are,  at  the  beginning  of  an  iteration,  always  taken 
as  the  m  largest  variables  (ties  are  broken  arbitrarily);  (2)  the  formula  for  Az  is 
replaced  by 

A  f  -n  if  n  <  0 

1  -xtn  if  n  >  o 

Establish  the  global  convergence  of  this  algorithm. 

20.  Find  the  exact  solution  to  the  example  presented  in  Sect.  12.4. 

21.  Find  the  direction  of  movement  that  would  be  taken  by  the  gradient  projection 
method  if  in  the  example  of  Sect.  12.4  the  constraint  X4  =  0  were  relaxed.  Show 
that  if  the  term  -3^4  in  the  objective  function  were  replaced  by  -X4,  then  both 
the  gradient  projection  method  and  the  reduced  gradient  method  would  move  in 
identical  directions. 

22.  Show  that  in  terms  of  convergence  characteristics,  the  reduced  gradient  method 
behaves  like  the  gradient  projection  method  applied  to  a  scaled  version  of  the 
problem. 
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23.  Let  r  be  the  condition  number  of  Lm  and  s  the  condition  number  of  CrC.  Show 
that  the  rate  of  convergence  of  the  reduced  gradient  method  is  no  worse  than 
[(sr  -  l)/(sr  +  l)]2. 

24.  Formulate  the  symmetric  version  of  the  hanging  chain  problem  using  a  single 
constraint.  Find  an  explicit  expression  for  the  condition  number  of  the  corre¬ 
sponding  CrC  matrix  (assuming  y\  is  basic).  Use  Exercise  23  to  obtain  an  es¬ 
timate  of  the  convergence  rate  of  the  reduced  gradient  method  applied  to  this 
problem,  and  compare  it  with  the  rate  obtained  in  Table  12.1,  Sect.  12.7.  Repeat 
for  the  two-constraint  formulation  (assuming  y i  and  yn  are  basic). 

25.  Referring  to  Exercise  19  establish  a  global  convergence  result  for  the  convex 
simplex  method. 
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Chapter  13 

Penalty  and  Barrier  Methods 


Penalty  and  barrier  methods  are  procedures  for  approximating  constrained  optimiza¬ 
tion  problems  by  unconstrained  problems.  The  approximation  is  accomplished  in 
the  case  of  penalty  methods  by  adding  to  the  objective  function  a  term  that  pre¬ 
scribes  a  high  cost  for  violation  of  the  constraints,  and  in  the  case  of  barrier  meth¬ 
ods  by  adding  a  term  that  favors  points  interior  to  the  feasible  region  over  those  near 
the  boundary.  Associated  with  these  methods  is  a  parameter  corg  that  determines 
the  severity  of  the  penalty  or  barrier  and  consequently  the  degree  to  which  the  unc¬ 
onstrained  problem  approximates  the  original  constrained  problem.  For  a  problem 
with  n  variables  and  m  constraints,  penalty  and  barrier  methods  work  directly  in 
the  n-dimensional  space  of  variables,  as  compared  to  primal  methods  that  work  in 
(n  -  m)- dimensional  space. 

There  are  two  fundamental  issues  associated  with  the  methods  of  this  chapter. 
The  first  has  to  do  with  how  well  the  unconstrained  problem  approximates  the  con¬ 
strained  one.  This  is  essential  in  examining  whether,  as  the  parameter  c  is  increased 
toward  infinity,  the  solution  of  the  unconstrained  problem  converges  to  a  solution 
of  the  constrained  problem.  The  other  issue,  most  important  from  a  practical  view¬ 
point,  is  the  question  of  how  to  solve  a  given  unconstrained  problem  when  its  obj¬ 
ective  function  contains  a  penalty  or  barrier  term.  It  turns  out  that  as  c  is  increased 
to  yield  a  good  approximating  problem,  the  corresponding  structure  of  the  resulting 
unconstrained  problem  becomes  increasingly  unfavorable  thereby  slowing  the  con¬ 
vergence  rate  of  many  algorithms  that  might  be  applied.  (Exact  penalty  functions 
also  have  a  very  unfavorable  structure.)  It  is  necessary,  then,  to  devise  acceleration 
procedures  that  circumvent  this  slow  convergence  phenomenon. 

Penalty  and  barrier  methods  are  of  great  interest  to  both  the  practitioner  and 
the  theorist.  To  the  practitioner  they  offer  a  simple  straightforward  method  for  han¬ 
dling  constrained  problems  that  can  be  implemented  without  sophisticated  com¬ 
puter  programming  and  that  possess  much  the  same  degree  of  generality  as  primal 
methods.  The  theorist,  striving  to  make  this  approach  practical  by  overcoming  its 
inherently  slow  convergence,  finds  it  appropriate  to  bring  into  play  nearly  all  aspects 
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of  optimization  theory;  including  Lagrange  multipliers,  necessary  conditions,  and 
many  of  the  algorithms  discussed  earlier  in  this  book.  The  canonical  rate  of  conver¬ 
gence  associated  with  the  original  constrained  problem  again  asserts  its  fundamental 
role  by  essentially  determining  the  natural  accelerated  rate  of  convergence  for  unc¬ 
onstrained  penalty  or  barrier  problems. 


13.1  Penalty  Methods 

Consider  the  problem 

minimize  fix) 
subject  to  xeS, 

where  /  is  a  continuous  function  on  En  and  S  is  a  constraint  set  in  En.  In  most 
applications  S  is  defined  implicitly  by  a  number  of  functional  constraints,  but  in 
this  section  the  more  general  description  in  (13.1)  can  be  handled.  The  idea  of  a 
penalty  function  method  is  to  replace  problem  (13.1)  by  an  unconstrained  problem 
of  the  form 

minimize  /(x)  +  cP(x ),  (13.2) 

where  c  is  a  positive  constant  and  P  is  a  function  on  En  satisfying:  (i)  P  is  continu¬ 
ous,  (ii)  P(x)  >  0  for  all  x  e  En,  and  (iii)  P(x)  =  0  if  and  only  if  x  e  S . 

Example  1.  Suppose  S  is  defined  by  a  number  of  inequality  constraints: 

S  =  {x  :  gi(x)  <0,  i  =  1,2,  ...,  p). 

A  very  useful  penalty  function  in  this  case  is 

l  p 

P(X)  =  -  V(max[0,  g,(x)])2- 

i=  1 

The  function  cP(x)  is  illustrated  in  Fig.  13.1  for  the  one-dimensional  case  with 
g\ (x)  =  x  -  b,  g2(x)  =  a  -  x. 

For  large  c  it  is  clear  that  the  minimum  point  of  problem  (13.2)  will  be  in  a  region 
where  P  is  small.  Thus,  for  increasing  c  it  is  expected  that  the  corresponding  so¬ 
lution  points  will  approach  the  feasible  region  S  and,  subject  to  being  close,  will 
minimize  /.  Ideally  then,  as  c  — »  oo  the  solution  point  of  the  penalty  problem  will 
converge  to  a  solution  of  the  constrained  problem. 


The  Method 

The  procedure  for  solving  problem  (13.1)  by  the  penalty  function  method  is  this: 
Let  {c^},  k  =  1,2,  . . .,  be  a  sequence  tending  to  infinity  such  that  for  each 
k,  Ck  >  0,  Ck+  i  >  Ck.  Define  the  function 

q(c,  x)  =  /(x)  +  cP(x). 


(13.3) 
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Fig.  13.1  Plot  of  cP(x ) 


For  each  k  solve  the  problem 


minimize  q(ck ,  x), 


(13.4) 


obtaining  a  solution  point  xk. 

We  assume  here  that,  for  each  k ,  problem  (13.4)  has  a  solution.  This  will  be  true, 
for  example,  if  q(c ,  x)  increases  unboundedly  as  |x|  — >  oo.  (Also  see  Exercise  2  to 
see  that  it  is  not  necessary  to  obtain  the  minimum  precisely.) 


Convergence 

The  following  lemma  gives  a  set  of  inequalities  that  follow  directly  from  the  defini¬ 
tion  of  Xk  and  the  inequality  Ck+i  >  ck. 


Lemma  1. 


q(ck->  xk)  ^  q{ck+ 1,  ) 

P(xk)  >  P(xk+l) 
f(xk )  <  f(xk+ 1). 


(13.5) 

(13.6) 

(13.7) 


Proof. 


q(Ck+ 1,  xt+i)  =  /(x*+i)  +  c*+i /*(**+ 1)  >  /(xt+i)  +  Ct/’Cxfc+i) 
>  /(xt)  +  CkP(xk )  =  Xi), 


which  proves  (13.5). 
We  also  have 


f(xk)  +  CkPiXk)  <  f(xk+ 1)  +  ckP(xk+ 1) 
/(Xfc+l)  +  Ci+i^Xt+O  <  /(X*)  +  Ci+iPfo). 


(13.8) 

(13.9) 
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Adding  (13.8)  and  (13.9)  yields 

(ck+ 1  -  ck)P(xk+i )  <  (ck+ 1  -  ck)P(xk), 


which  proves  (13.6). 
Also 


/(x*+i)  +  ckP(xk+ 1)  >  f(xk)  +  ckP(xk ), 


and  hence  using  (13.6)  we  obtain  (13.7). 


i 


Lemma  2.  x*  be  a  solution  to  problem  (13.1).  Then  for  each  k 

/(x*)  >  q(ck ,  x*)  ^  /(x*). 


Proof. 

f(x*)  =  fix*)  +  CkPix*)  >  fixk)  +  CkPixk)  >  fixk).  I 

Global  convergence  of  the  penalty  method,  or  more  precisely  verification  that  any 
limit  point  of  the  sequence  is  a  solution,  follows  easily  from  the  two  lemmas  above. 

Theorem.  Let  {jc^}  be  a  sequence  generated  by  the  penalty  method.  Then,  any  limit  point  of 
the  sequence  is  a  solution  to  (13.1). 

Proof.  Suppose  the  subsequence  {x^},  k  e  7C  is  a  convergent  subsequence  of  {x^} 
having  limit  x.  Then  by  the  continuity  of  /,  we  have 


limit /(x*)  =  /(x). 

keK 


(13.10) 


Let  f*  be  the  optimal  value  associated  with  problem  (13.1).  Then  according  to 
Lemmas  1  and  2,  the  sequence  of  values  q(ck,  xk)  is  nondecreasing  and  bounded 
above  by  /* .  Thus 

limit  q(ck,  xk)  =  q*  <  f.  (13.11) 

Subtracting  (13.10)  from  (13.11)  yields 


limit  ckP(xk)  =  q*  -  fix). 

keK 


Since  P(xk)  >  0  and  ck  — >  oo,  (13.12)  implies 


(13.12) 


limit  P(xk)  =  0. 

k£K 


Using  the  continuity  of  P,  this  implies  P(x)  =  0.  We  therefore  have  shown  that  the 
limit  point  x  is  feasible  for  (13.1). 

To  show  that  x  is  optimal  we  note  that  from  Lemma  2,  f(xk)  <  /*  and  hence 

fix)  =  limit k&Kfixk)  <  /*.l 
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13.2  Barrier  Methods 


Barrier  methods  are  applicable  to  problems  of  the  form 


minimize  /(x) 
subject  to  xeS, 


(13.13) 


where  the  constraint  set  S  has  a  nonempty  interior  that  is  arbitrarily  close  to  any 
point  of  S .  Intuitively,  what  this  means  is  that  the  set  has  an  interior  and  it  is  possible 
to  get  to  any  boundary  point  by  approaching  it  from  the  interior.  We  shall  refer  to 
such  a  set  as  robust.  Some  examples  of  robust  and  nonrobust  sets  are  shown  in 
Fig.  13.2.  This  kind  of  set  often  arises  in  conjunction  with  inequality  constraints, 
where  S  takes  the  form 


S  =  {x  :  gi(x)  <  0,  i  =  1,2,  . . .,  p) 

Barrier  methods  are  also  termed  interior  methods.  They  work  by  establishing  a  bar¬ 
rier  on  the  boundary  of  the  feasible  region  that  prevents  a  search  procedure  from 
leaving  the  region.  A  barrier  function  is  a  function  B  defined  on  the  interior  of  S 
such  that:  (i)  B  is  continuous,  (ii)  B(x)  >  0,  (iii)  B(x)  — >  oo  as  x  approaches  the 
boundary  of  S . 

Example  1.  Let  gi,  i  =  1,  2,  . . . ,  p  be  continuous  functions  on  En.  Suppose 

S  ={x:  gf(x )  <0,  i  =  1,2,  ...,  p}. 

is  robust,  and  suppose  the  interior  of  S  is  the  set  of  x’s  where  gfx)  <  0,  i  = 
1,2,  . . . ,  p.  Then  the  function 


1 


g/(x) 


defined  on  the  interior  of  S ,  is  a  barrier  function.  It  is  illustrated  in  one  dimension 
for  gi  =  x  -  a,  g2  =  x  -  b  in  Fig.  13.3. 


Robust 


Not  robust 


Not  robust 


Fig.  13.2  Examples 
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Example  2.  For  the  same  situation  as  Example  1 ,  we  may  use  the  logarithmic  utility 
function 

p 

B(x)  =  -  log  I  — £/(x)  |. 

i=  1 

This  is  the  barrier  function  commonly  used  in  linear  programming  interior  point 
methods,  and  it  is  frequently  used  more  generally  as  well. 


Corresponding  to  the  problem  (13.13),  consider  the  approximate  problem 


minimize  /(x)  +  ^5(x) 
subject  to  x  e  interior  of  S, 


(13.14) 


where  c  is  a  positive  constant. 

Alternatively,  it  is  common  to  formulate  the  barrier  method  as 


minimize  /(x)  +  pB(x) 
subject  to  x  e  interior  of  S. 


(13.15) 


Fig.  13.3  Barrier  function 


When  formulated  with  c  we  take  c  large  (going  to  infinity);  while  when  for¬ 
mulated  with  p  we  take  p  small  (going  to  zero).  Either  way  the  result  is  a  con¬ 
strained  problem,  and  indeed  the  constraint  is  somewhat  more  complicated  than  in 
the  original  problem  (13.13).  The  advantage  of  this  problem,  however,  is  that  it  can 
be  solved  by  using  an  unconstrained  search  technique.  To  find  the  solution  one  starts 
at  an  initial  interior  point  and  then  searches  from  that  point  using  steepest  descent 
or  some  other  iterative  descent  method  applicable  to  unconstrained  problems.  Since 
the  value  of  the  objective  function  approaches  infinity  near  the  boundary  of  S ,  the 
search  technique  (if  carefully  implemented)  will  automatically  remain  within  the 
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interior  of  S ,  and  the  constraint  need  not  be  accounted  for  explicitly.  Thus,  although 
problem  (13.14)  or  (13.15)  is  from  a  formal  viewpoint  a  constrained  problem,  from 
a  computational  viewpoint  it  is  unconstrained. 


The  Method 


The  barrier  method  is  quite  analogous  to  the  penalty  method.  Let  [ck\  be  a  sequence 
tending  to  infinity  such  that  for  each  k ,  k  -  1,2,  . . . ,  Ck  >  0,  Ck+i  >  c&.  Define  the 
function 

r(c,  x)  =  /(x)  +  -B(x). 

c 


For  each  k  solve  the  problem 


minimize  r(c£,  x) 
subject  to  x  e  interior  ofiS, 


obtaining  the  point  x^. 


Convergence 

Virtually  the  same  convergence  properties  hold  for  the  barrier  method  as  for  the 
penalty  method.  We  leave  to  the  reader  the  proof  of  the  following  result. 

Theorem.  Any  limit  point  of  a  sequence  {x^}  generated  by  the  barrier  method  is  a  solution 
to  problem  (13.13). 


13.3  Properties  of  Penalty  and  Barrier  Functions 

Penalty  and  barrier  methods  are  applicable  to  nonlinear  programming  problems 
having  a  very  general  form  of  constraint  set  S .  In  most  situations,  however,  this 
set  is  not  given  explicitly  but  is  defined  implicitly  by  a  number  of  functional  con¬ 
straints.  In  these  situations,  the  penalty  or  barrier  function  is  invariably  defined  in 
terms  of  the  constraint  functions  themselves;  and  although  there  are  an  unlimited 
number  of  ways  in  which  this  can  be  done,  some  important  general  implications 
follow  from  this  kind  of  construction. 

For  economy  of  notation  we  consider  problems  of  the  form 


minimize  /  (x) 

subject  to  gt  (x)  <  0,  i  =  1, 2, . . . ,  p. 


(13.16) 


404 


13  Penalty  and  Barrier  Methods 


For  our  present  purposes,  equality  constraints  are  suppressed,  at  least  notationally, 
by  writing  each  of  them  as  two  inequalities.  If  the  problem  is  to  be  attacked  with 
a  barrier  method,  then,  of  course,  equality  constraints  are  not  present  even  in  an 
unsuppressed  version. 


Penalty  Functions 

A  penalty  function  for  a  problem  expressed  in  the  form  (13.16)  will  most  naturally 
be  expressed  in  terms  of  the  auxiliary  constraint  functions 

gf (x)  =  max[0,  gi(x)],  i  =  1, 2,  . . . ,  p.  (13.17) 

This  is  because  in  the  interior  of  the  constraint  region  P(x)  =  0  and  hence  P  should 
be  a  function  only  of  violated  constraints.  Denoting  by  g+(x)  the  p-dimensional 
vector  made  up  of  the  gf  (x)’s,  we  consider  the  general  class  of  penalty  functions 

P(x)  =  y(g+(x)),  (13.18) 

where  y  is  a  continuous  function  from  Ep  to  the  real  numbers,  defined  in  such  a  way 
that  P  satisfies  the  requirements  demanded  of  a  penalty  function. 

Example  7.  Set 

P( x)  =  =  |lg+(x)|2, 

1=1  ~ 

which  is  without  doubt  the  most  popular  penalty  function.  In  this  case  y  is  one-half 
times  the  identity  quadratic  form  on  Ep ,  that  is,  y(y)  =  ||y|2. 

Example  2.  By  letting 

y(y)  =  yrry, 

where  T  is  a  symmetric  positive  definite  p  x  p  matrix,  we  obtain  the  penalty  function 

p(x)  =  g+(x)rrg+(x). 

Example  3.  A  general  class  of  penalty  functions  is 

p 

p(x)  =  Ej(gt(x)r 

i=  1 


for  some  s  >  0. 
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Lagrange  Multipliers 


In  the  penalty  method  we  solve,  for  various  Ck,  the  unconstrained  problem 

minimize  /(x)  +  c^P(x).  (13.19) 


Most  algorithms  require  that  the  objective  function  has  continuous  first  partial 
derivatives.  Since  we  shall,  as  usual,  assume  that  both  /  and  g  e  C1,  it  is  natural 
to  require,  then,  that  the  penalty  function  P  e  C1 .  We  define 


Vg+(x) 


|Vg/(x)ifg/(x)  >  0 
(0  if  g/(x)  <  0 


(13.20) 


and,  of  course,  Vg+(x)  is  the  mxn  matrix  whose  rows  are  the  Vgt  ’s.  Unfortunately, 
Vg+  is  usually  discontinuous  at  points  where  gt(x)  =  0  for  some  i  =  1,2,  . . . ,  p, 
and  thus  some  restrictions  must  be  placed  on  y  in  order  to  guarantee  P  e  C1.  We 
assume  that  y  e  C1  and  that  if  y  =  (yi,  yi,  •  •  • ,  ?*),  Vy(y)  =  (Vyu  Vy2,  •  •  • ,  Vyn), 
then 

yi  =  0  implies  Vy*  =  0.  (13.21) 


(In  Example  3  above,  for  instance,  this  condition  is  satisfied  only  for  s  >  1.)  With 
this  assumption,  the  derivative  of  y(g+(x))  with  respect  to  x  is  continuous  and  can  be 
written  as  Vy(g+(x))Vg(x).  In  this  result  Vg(x)  legitimately  replaces  the  discontin¬ 
uous  Vg+(x),  because  it  is  premultiplied  by  Vy(g+(x)).  Of  course,  these  considera¬ 
tions  are  necessary  only  for  inequality  constraints.  If  equality  constraints  are  treated 
directly,  the  situation  is  far  simpler. 

In  view  of  this  assumption,  problem  (13.19)  will  have  its  solution  at  a  point  x^ 
satisfying 

Vf(xk)  +  V  y(g+  (x*;))  V  g(xfc)  =  0, 

which  can  be  written  as 

V/(Xifc)  +  4rV  g(x*)  =  0,  (13.22) 

where 

4  =  ckVy(g+(xk)).  (13.23) 

Thus,  associated  with  every  c  is  a  Lagrange  multiplier  vector  that  is  determined  after 
the  unconstrained  minimization  is  performed. 

If  a  solution  x*  to  the  original  problem  (1 3. 16)  is  a  regular  point  of  the  constraints, 
then  there  is  a  unique  Lagrange  multiplier  vector  A*  associated  with  the  solution.  The 
result  stated  below  says  that  Ak  — >  A* . 


Proposition.  Suppose  that  the  penalty  function  method  is  applied  to  problem  (13.16)  using 
a  penalty  function  of  the  form  (13.18)  with  y  e  C1  and  satisfying  (13.21).  Corresponding 
to  the  sequence  {x^}  generated  by  this  method,  define  Ak  =  CkVy(g+(Xk))-  If  *k  x*> 
a  solution  to  (13.16),  and  this  solution  is  a  regular  point,  then  Ak  — >  A*,  the  Lagrange 
multiplier  associated  with  problem  (13.16). 


Proof  left  to  the  reader. 
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Example  4.  For  P(x)  =  ||g+(x)|2  we  have  Ak  -  Ckg+(xk). 

As  a  final  observation  we  note  that  in  general  if  Xk  — >  x*,  then  since  Ak  = 
c^Vy(g+(x^))r  — >  A*,  the  sequence  x^  approaches  x*  from  outside  the  constraint 
region.  Indeed,  as  x^  approaches  x*  all  constraints  that  are  active  at  x*  and  have 
positive  Lagrange  multipliers  will  be  violated  at  x^  because  the  corresponding  com¬ 
ponents  of  Vy(g+(xy0)  are  positive.  Thus,  if  we  assume  that  the  active  constraints  are 
nondegenerate  (all  Lagrange  multipliers  are  strictly  positive),  every  active  constraint 
will  be  approached  from  the  outside. 


The  Hessian  Matrix 

Since  the  penalty  function  method  must,  for  various  (large)  values  of  c,  solve  the 
unconstrained  problem 

minimize  fix)  +  cP(x ),  (13.24) 

it  is  important,  in  order  to  evaluate  the  difficulty  of  such  a  problem,  to  determine 
the  eigenvalue  structure  of  the  Hessian  of  this  modified  objective  function.  We  show 
here  that  the  structure  becomes  increasingly  unfavorable  as  c  increases. 

Although  in  this  section  we  require  that  the  function  P  e  C1,  we  do  not  require 
that  P  £  C2.  In  particular,  the  most  popular  penalty  function  P(x)  =  ||g+(x)|2, 
illustrated  in  Fig.  13. 1  for  one  component,  has  a  discontinuity  in  its  second  derivative 
at  any  point  where  a  component  of  g  is  zero.  At  first  this  might  appear  to  be  a 
serious  drawback,  since  it  means  the  Hessian  is  discontinuous  at  the  boundary  of  the 
constraint  region — right  where,  in  general,  the  solution  is  expected  to  lie.  However, 
as  pointed  out  above,  the  penalty  method  generates  points  that  approach  a  boundary 
solution  from  outside  the  constraint  region.  Thus,  except  for  some  possible  chance 
occurrences,  the  sequence  will,  as  x^  — >  x\  be  at  points  where  the  Hessian  is  well- 
defined.  Furthermore,  in  iteratively  solving  the  unconstrained  problem  (13.24)  with 
a  fixed  Ck ,  a  sequence  will  be  generated  that  converges  to  Xk  which  is  (for  most 
values  of  k )  a  point  where  the  Hessian  is  well-defined,  and  hence  the  standard  type 
of  analysis  will  be  applicable  to  the  tail  of  such  a  sequence. 

Defining  q(c ,  x)  =  /(x)  +  cy(g+(x))  we  have  for  the  Hessian,  Q,  of  q  (with  respect 
to  x) 

Q (C,  X)  =  F(x)  +  cVy(g+(x))G(x)  +  cVg+(x)rr(g+(x))Vg+(x), 

where  F,  G,  and  T  are,  respectively,  the  Hessians  of  /,  g,  and  y.  For  a  fixed  Ck  we 
use  the  definition  of  Ak  given  by  (13.23)  and  introduce  the  rather  natural  definition 

L*(x*)  =  F(x*)  +  i[G(x*),  (13.25) 

which  is  the  Hessian  of  the  corresponding  Lagrangian.  Then  we  have 

Q(ck,  Xk)  =  L*(*)  +  c*V g+(xt)rr(g+(x*))Vg+(xt),  (13.26) 


which  is  the  desired  expression. 
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The  first  term  on  the  right  side  of  (13.26)  converges  to  the  Hessian  of  the  La- 
grangian  of  the  original  constrained  problem  as  x^  — >  x*,  and  hence  has  a  limit  that 
is  independent  of  c^.  The  second  term  is  a  matrix  having  rank  equal  to  the  rank  of 
the  active  constraints  and  having  a  magnitude  tending  to  infinity.  (See  Exercise  7.) 

Example  5.  For  P(x)  =  ||g+(x)|2  we  have 


where 


Thus 


r(g+(x,)) 


e\  0  •  •  •  0 

0  £2  0 

0  • 


0  •  •  •  0  ep 


[  1  if  gi(*k)  >  0 

ei  =  l  0  if  gt(xk)  <  0 

( undefined  if  gi(xfi)  =  0 


c*Vg+(x*)r(g+(x*))Vg+(x*)  =  ckVg+(xk)TVg+(xk), 


which  is  Ck  times  a  matrix  that  approaches  Vg+(x*)rVg+(x*).  This  matrix  has  rank 
equal  to  the  rank  of  the  active  constraints  at  x*  [refer  to  (13.20)]. 


Assuming  that  there  are  r  active  constraints  at  the  solution  x\  then  for  well- 
behaved  y,  the  Hessian  matrix  Q(c^,  x*)  has  r  eigenvalues  that  tend  to  infinity  as 
Ck  — >  oo,  arising  from  the  second  term  on  the  right  side  of  (13.26).  There  will  be 
n  -  r  other  eigenvalues  that,  although  varying  with  cy,  tend  to  finite  limits.  These 
limits  turn  out  to  be,  as  is  perhaps  not  too  surprising  at  this  point,  the  eigenvalues 
of  L(x*)  restricted  to  the  tangent  subspace  M  of  the  active  constraints.  The  proof  of 
this  requires  some  further  analysis. 


Lemma  1.  Let  A(c)  be  a  symmetric  matrix  written  in  partitioned  form 


A(c)  = 


Ai(c)  A2(c ) 
A\ (c)  A3(c) 


9 


(13.27) 


where  A\  (c)  tends  to  a  positive  definite  matrix  A\,  A2(c )  tends  to  a  finite  matrix,  and  A3(c) 
is  a  positive  definite  matrix  tending  to  infinity  with  c  (that  is,  for  any  s  >  0,  A  fie)  si  is 
positive  definite  for  sufficiently  large  c).  Then 


A-’(c) 


A~ 1  0 

0  0 


(13.28) 


as  c 


oo. 


Proof.  We  have  the  identity 


Ai  A2 

-1 

(A^AzAj'A^)-1 

-(Ai  -  A2A3l Al)A2X3l 

A2rA3_ 

--A^A^-AjAj'A^r1 

(A3-At2A~1A  2)-‘ 

Using  the  fact  that  A3  !(c)  — >  0  gives  the  result.  I 


(13.29) 


408 


13  Penalty  and  Barrier  Methods 


To  apply  this  result  to  the  Hessian  matrix  (13.26)  we  associate  A  with  Q (ck,  x^) 
and  let  the  partition  of  A  correspond  to  the  partition  of  the  space  En  into  the  subspace 
M  and  the  subspace  N  that  is  orthogonal  to  M;  that  is,  N  is  the  subspace  spanned  by 
the  gradients  of  the  active  constraints.  In  this  partition,  LM,  the  restriction  of  L  to 
M,  corresponds  to  the  matrix  Ai . 

We  leave  the  details  of  the  required  continuity  arguments  to  the  reader.  The 
important  conclusion  is  that  if  x*  is  a  solution  to  (13.16),  is  a  regular  point,  and 
has  exactly  r  active  constraints  none  of  which  are  degenerate,  then  the  Hessian  ma¬ 
trices  Q (c^  XjO  of  a  penalty  function  of  form  (13.18)  have  r  eigenvalues  tending  to 
infinity  as  Ck  — »  oo,  and  n  -  r  eigenvalues  tending  to  the  eigenvalues  of  L m- 

This  explicit  characterization  of  the  structure  of  penalty  function  Hessians  is  of 
great  importance  in  the  remainder  of  the  chapter.  The  fundamental  point  is  that 
virtually  any  choice  of  penalty  function  (within  the  class  considered)  leads  both  to 
an  ill-conditioned  Hessian  and  to  consideration  of  the  ubiquitous  Hessian  of  the 
Lagrangian  restricted  to  M. 


Barrier  Functions 


Essentially  the  same  story  holds  for  barrier  function.  If  we  consider  for  Prob¬ 
lem  (13.16)  barrier  functions  of  the  form 


£(x)  =  rj(  g(x)), 


(13.30) 


then  Lagrange  multipliers  and  ill-conditioned  Hessians  are  again  inevitable.  Rather 
than  parallel  the  earlier  analysis  of  penalty  functions,  we  illustrate  the  conclusions 
with  two  examples. 


Example  1.  Define 


The  barrier  objective 


has  its  minimum  at  a  point  x^  satisfying 


(13.31) 


1  p 

V/(x*)  +  -  V 
ck  'H 


1 

gi(Xk)2 


Vg;(x*)  =  0. 


(13.32) 


Thus,  we  define  Au  to  be  the  vector  having  /th  component  — . 

Ck 

can  be  written  as 


— .  Then  (13.32) 

gi(*kY  V  ' 


V/(x*)  +  /l[V  g(Xjt)  =  0. 
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Again,  assuming  — >  x\  the  solution  of  (13.16),  we  can  show  that  Ak  — >  A*,  the 
Lagrange  multiplier  vector  associated  with  the  solution.  This  implies  that  if  gi  is  an 
active  constraint, 


1 

ckgi(xk)2 


A*  <  OO. 


(13.33) 


Next,  evaluating  the  Hessian  R(c^,  x*)  of  r(ck,  x^),  we  have 


,  ,3  Vg,(xOrVg,(xO 
gi(xkY 


Vgi(xk)TVgj(xk). 

gi(x*)3 


As  q  — >  oo  we  have 


-1 


CkSifek)' 


oo  if  g;  is  active  at  x* 

0  if  gf  is  inactive  at  xH 


so  that  we  may  write,  from  (13.33). 


A, 


R (Ck,  Xk)  -»  L(x*)  +  V - 7^TVgI-(x*)rVgI-(x*)> 

^  gi(Xt) 


/el 


where  I  is  the  set  of  indices  corresponding  to  active  constraints.  Thus  the  Hessian 
of  the  barrier  objective  function  has  exactly  the  same  structure  as  that  of  penalty 
objective  functions. 


Example  2.  Let  us  use  the  logarithmic  barrier  function 


p 

B(x)  =  -  ^  log[— g,-(x)] 

i=\ 

In  this  case  we  will  define  the  barrier  objective  in  terms  of  p  as 


The  minimum  point  x^  satisfies 


(13.34) 


Defining 


-1 
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(13.34)  can  be  written  as 


V/(x;/)  +  /l7Vg(x;()  =  0. 


Further  we  expect  that  Au  —>  A*  as  //  —>  0. 
The  Hessian  of  r(ju,  x)  is 


Vg,(xA,)7Vgi(x/J). 


Hence,  for  small  /i  it  has  the  same  structure  as  that  found  in  Example  1 . 


The  Central  Path 

The  definition  of  the  central  path  associated  with  linear  programs  is  easily  extended 
to  general  nonlinear  programs.  For  example,  consider  the  problem 

minimize  /(x) 

subject  to  h(x)  =  0,  g(x)  <  0. 

o 

We  assume  that  T  -  {x  :  h(x)  =  0,  g(x)  <  0}  ^  0.  Then  we  use  the  logarithmic 
barrier  function  to  define  the  problems 

minimize  /(x)  -  ft  £f=1  log[-g,-(x)] 
subject  to  h(x)  =  0. 

The  solution  x^  parameterized  by  fi  — >  0  is  called  the  central  path;  see  Chap.  5. 

The  necessary  conditions  for  the  problem  can  be  written  as 

V/(x„)  +  srVg(^)  +  yrVh(X|U)  =  0 

h(x^)  =  0. 

SigiOLf,)  =  -IT,  i=  1,2,  p 

where  y  is  the  Lagrange  multiplier  vector  for  the  constraint  h(x^)  =  0.  Then,  the 
Newton  method  can  be  applied  to  solving  the  condition  system  as  ji  is  gradually 
reduced  to  0,  that  is,  following  the  path. 


Geometric  Interpretation:  The  Primal  Function 

There  is  a  geometric  construction  that  provides  a  simple  interpretation  of  penalty 
functions.  The  basis  of  the  construction  itself  is  also  useful  in  other  areas  of  opti¬ 
mization,  especially  duality  theory,  as  explained  in  the  next  chapter. 
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Let  us  again  consider  the  problem 


minimize  /(x) 
subject  to  h(x)  =  0, 


(13.35) 


where  h(x)  e  Em.  We  assume  that  the  solution  point  x*  of  (13.35)  is  a  regular  point 
and  that  the  second-order  sufficiency  conditions  are  satisfied.  Corresponding  to  this 
problem  we  introduce  the  following  definition: 

Definition.  Corresponding  to  the  constrained  minimization  problem  (13.35),  the  primal 
function  to  is  defined  on  Em  in  a  neighborhood  of  0  to  be 

<a(y)  =  min {/(x)  :  h(x)  =  y}.  (13.36) 


The  primal  function  gives  the  optimal  value  of  the  objective  for  various  values  of 
the  right-hand  side.  In  particular  tu(0)  gives  the  value  of  the  original  problem. 

Strictly  speaking  the  minimum  in  the  definition  (13.36)  must  be  specified  as  a 
local  minimum,  in  a  neighborhood  of  x*.  The  existence  of  co( y)  then  follows  di¬ 
rectly  from  the  Sensitivity  Theorem  in  Sect.  1 1.7.  Furthermore,  from  that  theorem  it 
follows  that  Vru(0)  =  -A*T. 

Now  consider  the  penalty  problem  and  note  the  following  relations: 

min{/(x)  +  lc|h(x)|2}  =  minx>y{/(x)  +  lc|y|2  :  h(x)  =  y) 

=  minyMy)  +  3.c|y|2}.  (13.37) 


Fig.  13.4  The  primal  function 

This  is  illustrated  in  Fig.  13.4  for  the  case  where  y  is  one-dimensional.  The  primal 
function  is  the  lowest  curve  in  the  figure.  Its  value  at  y  -  0  is  the  value  of  the 
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original  constrained  problem.  Above  the  primal  function  are  the  curves  coiy)  +  \cy2 
for  various  values  of  c.  The  value  of  the  penalty  problem  is  shown  by  (13.37)  to  be 
the  minimum  point  of  this  curve.  For  large  values  of  c  this  curve  becomes  convex 
near  0  even  if  aj(y)  is  not  convex.  Viewed  in  this  way,  the  penalty  functions  can  be 
thought  of  as  convexifying  the  primal. 

Also,  as  c  increases,  the  associated  minimum  point  moves  toward  0.  However,  it 
is  never  zero  for  finite  c.  Furthermore,  in  general,  the  criterion  for  u  to  be  optimal 
for  the  penalty  problem  is  that  the  gradient  of  o)(y)  +  ^cy2  equals  zero.  This  yields 
Vtu(y)  +  cyT  =  0.  Using  Vtu(y)  =  -AT  and  y  =  h(xc),  where  now  xc  denotes 
the  minimum  point  of  the  penalty  problem,  gives  A  =  ch(xc),  which  is  the  same 
as  (13.23). 


13.4  Newton’s  Method  and  Penalty  Functions 


In  the  next  few  sections  we  address  the  problem  of  efficiently  solving  the  uncon¬ 
strained  problems  associated  with  a  penalty  or  barrier  method.  The  main  difficulty 
is  the  extremely  unfavorable  eigenvalue  structure  that,  as  explained  in  Sect.  13.3, 
always  accompanies  unconstrained  problems  derived  in  this  way.  Certainly  straight¬ 
forward  application  of  the  method  of  steepest  descent  is  out  of  the  question! 

One  method  for  avoiding  slow  convergence  for  these  problems  is  to  apply  New¬ 
ton’s  method  (or  one  of  its  variations),  since  the  order  two  convergence  of  Newton’s 
method  is  unaffected  by  the  poor  eigenvalue  structure.  In  applying  the  method,  how¬ 
ever,  special  care  must  be  devoted  to  the  manner  by  which  the  Hessian  is  inverted, 
since  it  is  ill-conditioned.  Nevertheless,  if  second-order  information  is  easily  avail¬ 
able,  Newton’s  method  offers  an  extremely  attractive  and  effective  method  for  solv¬ 
ing  modest  size  penalty  or  barrier  optimization  problems.  When  such  information 
is  not  readily  available,  or  if  data  handling  and  storage  requirements  of  Newton’s 
method  are  excessive,  attention  naturally  focuses  on  first-order  methods. 

A  simple  modified  Newton’s  method  can  often  be  quite  effective  for  some  penalty 
problems.  For  example,  consider  the  problem  having  only  equality  constraints 


minimize  /(x) 
subject  to  h(x)  =  0 


(13.38) 


with  x  e  En ,  h(x)  e  Em,  m  <  n.  Applying  the  standard  quadratic  penalty  method 
we  solve  instead  the  unconstrained  problem 

minimize  /(x)  +  |c|h(x)|2  (13.39) 

for  some  large  c.  Calling  the  penalty  objective  function  q(x)  we  consider  the  iterative 
process 


x*+i  =  \k  -  ak[ I  +  cVlKx^VhtX/t)]  lVq(\k)T, 


(13.40) 
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where  is  chosen  to  minimize  q(xk+ 1).  The  matrix  I  +  cVh(x^)rVh(x^)  is  positive 
definite  and  although  quite  ill-conditioned  it  can  be  inverted  efficiently  (see 
Exercise  11). 

According  to  the  Modified  Newton  Method  Theorem  (Sect.  10.1)  the  rate  of  con¬ 
vergence  of  this  method  is  determined  by  the  eigenvalues  of  the  matrix 

[I  +  cVh(x,)rVh(x,)]-1Q(xO,  (13.41) 

where  Q(x^)  is  the  Hessian  of  q  at  x^.  In  view  of  (13.26),  as  c  — >  oo  the  ma¬ 
trix  (13.41)  will  have  m  eigenvalues  that  approach  unity,  while  the  remaining  n-m 
eigenvalues  approach  the  eigenvalues  of  L m  evaluated  at  the  solution  x*  of  (13.38). 
Thus,  if  the  smallest  and  largest  eigenvalues  of  L M,  a  and  A,  are  located  such  that 
the  interval  [ a ,  A]  contains  unity,  the  convergence  ratio  of  this  modified  Newton’s 
method  will  be  equal  (in  the  limit  of  c  — >  oo)  to  the  canonical  ratio  [(A  -  a) /(A  +  a)]2 
for  problem  (13.38). 

If  the  eigenvalues  of  L M  are  not  spread  below  and  above  unity,  the  convergence 
rate  will  be  slowed.  If  a  point  in  the  interval  containing  the  eigenvalues  of  L M  is 
known,  a  scalar  factor  can  be  introduced  so  that  the  canonical  rate  is  achieved,  but 
such  information  is  often  not  easily  available. 


Inequalities 

If  there  are  inequality  as  well  as  equality  constraints  in  the  problem,  the  analogous 
procedure  can  be  applied  to  the  associated  penalty  objective  function.  The  unusual 
feature  of  this  case  is  that  corresponding  to  an  inequality  constraint  g;(x)  «  0, 
the  term  Vgf  (x)TVgt (x)  used  in  the  iteration  matrix  will  suddenly  appear  if  the 
constraint  is  violated.  Thus  the  iteration  matrix  is  discontinuous  with  respect  to  x, 
and  as  the  method  progresses  its  nature  changes  according  to  which  constraints  are 
violated.  This  discontinuity  does  not,  however,  imply  that  the  method  is  subject  to 
jamming,  since  the  result  of  Exercise  4,  Chap.  10  is  applicable  to  this  method. 


13.5  Conjugate  Gradients  and  Penalty  Methods 

The  partial  conjugate  gradient  method  proposed  and  analyzed  in  Sect.  9.5  is  ideally 
suited  to  penalty  or  barrier  problems  having  only  a  few  active  constraints.  If  there  are 
m  active  constraints,  then  taking  cycles  of  m  +  1  conjugate  gradient  steps  will  yield 
a  rate  of  convergence  that  is  independent  of  the  penalty  constant  c.  For  example, 
consider  the  problem  having  only  equality  constraints: 


minimize  /(x) 
subject  to  h(x)  =  0, 


(13.42) 


414 


13  Penalty  and  Barrier  Methods 


where  x  e  En,  h(x)  e  Em,  m  <  n.  Applying  the  standard  quadratic  penalty  method, 
we  solve  instead  the  unconstrained  problem 

minimize  /(x)  +  |c|h(x)|2  (13.43) 


for  some  large  c.  The  objective  function  of  this  problem  has  a  Hessian  matrix  that 
has  m  eigenvalues  that  are  of  the  order  c  in  magnitude,  while  the  remaining  n  -  m 
eigenvalues  are  close  to  the  eigenvalues  of  the  matrix  LM,  corresponding  to  problem 
(13.42).  Thus,  letting  x^+i  be  determined  from  x^  by  taking  m  +  1  steps  of  a  (non¬ 
quadratic)  conjugate  gradient  method,  and  assuming  x^  — >  x,  a  solution  to  (13.43), 
the  sequence  {/(x^)}  converges  linearly  to  /(x)  with  a  convergence  ratio  equal  to 
approximately 


A  -  a\2 
A  +  a 


(13.44) 


where  a  and  A  are,  respectively,  the  smallest  and  largest  eigenvalues  of  L m®. 

This  is  an  extremely  effective  technique  when  m  is  relatively  small.  The  pro¬ 
gramming  logic  required  is  only  slightly  greater  than  that  of  steepest  descent,  and 
the  time  per  iteration  is  only  about  m  +  1  times  as  great  as  for  steepest  descent. 
The  method  can  be  used  for  problems  having  inequality  constraints  as  well  but  it  is 
advisable  to  change  the  cycle  length,  depending  on  the  number  of  constraints  active 
at  the  end  of  the  previous  cycle. 


Example  3. 


minimize  f(x i,  X2,  . . . ,  Mo)  =  2  kxl 

k= t 

subject  to  1.5m  +  m  +  M  +  0.5m  +  0.5m  =  5.5 
2.0m  -  0.5m  -  0.5m  +  M  -  Mo  =  2.0 
M  +  M  +  M  +  M  +  M  =  10 
M  +  m  +  m  3”  M  3”  Mo  =  15. 

This  problem  was  treated  by  the  penalty  function  approach,  and  the  resulting  com¬ 
posite  function  was  then  solved  for  various  values  of  c  by  using  various  cycle  lengths 
of  a  conjugate  gradient  algorithm.  In  Table  13.1  p  is  the  number  of  conjugate  gra¬ 
dient  steps  in  a  cycle.  Thus,  p  -  1  corresponds  to  ordinary  steepest  descent;  p  -  5 
corresponds,  by  the  theory  of  Sect.  9.5,  to  the  smallest  value  of  p  for  which  the  rate 
of  convergence  is  independent  of  c;  and  p  =  10  is  the  standard  conjugate  gradient 
method.  Note  that  for  p  <  5  the  convergence  rate  does  indeed  depend  on  c,  while  it 
is  more  or  less  constant  for  p  >  5.  The  value  of  c’s  selected  are  not  artificially  large, 
since  for  c  =  200  the  constraints  are  satisfied  only  to  within  0.5  %  of  their  right- 
hand  sides.  For  problems  with  nonlinear  constraints  the  results  will  most  likely  be 
somewhat  less  favorable,  since  the  predicted  convergence  rate  would  apply  only  to 
the  tail  of  the  sequence. 
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Table  13.1  Results  for  example  3 


p  (steps 
per  cycle) 

Number 
of  cycles 
to  conver¬ 
gence 

No.  of  steps 

Value  of  modified  objective 

o 

<N 

II 

O 

/ 

1 

90 

90 

388.565 

3 

8 

24 

388.563 

< 

4T 

3 

15 

388.563 

J 

3 

21 

388.563 

V 

7 

c  =  200 

/ 

1 

230a 

230 

488.607 

3 

21 

63 

487.446 

< 

4 

20 

487.438 

j 

2 

14 

487.433 

V 

7 

c  =  2000 

/ 

1 

260a 

260 

525.238 

3 

45a 

135 

503.550 

< 

4T 

3 

15 

500.910 

J 

3 

21 

500.882 

V 

7 

a  Program  not  run  to  convergence  due  to  excessive  time 


13.6  Normalization  of  Penalty  Functions 


There  is  a  good  deal  of  freedom  in  the  selection  of  penalty  or  barrier  functions  that 
can  be  exploited  to  accelerate  convergence.  We  propose  here  a  simple  normaliza¬ 
tion  procedure  that  together  with  a  two-step  cycle  of  conjugate  gradients  yields  the 
canonical  rate  of  convergence.  Again  for  simplicity  we  illustrate  the  technique  for 
the  penalty  method  applied  to  the  problem 


minimize  /(x) 
subject  to  h(x)  =  0 


(13.45) 


as  in  Sects.  13.4  and  13.5,  but  the  idea  is  easily  extended  to  other  penalty  or  barrier 
situations. 

Corresponding  to  (13.45)  we  consider  the  family  of  quadratic  penalty  functions 

P(x)  =  ih(x)7rh(x),  (13.46) 

where  F  is  a  symmetric  positive  definite  mxm  matrix.  We  ask  what  the  best  choice 
of  r  might  be. 

Letting 

q(c,  x)  =  f(x)  +  cP(x),  (13.47) 

the  Hessian  of  q  turns  out  to  be,  using  (13.26), 


Q(c,  xk)  =  L(x,t)  +  cVh(xt)7'rVh(xlt). 


(13.48) 
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The  m  large  eigenvalues  are  due  to  the  second  term  on  the  right.  The  observation 
we  make  is  that  although  the  m  large  eigenvalues  are  all  proportional  to  c,  they 
are  not  necessarily  all  equal.  Indeed,  for  very  large  c  these  eigenvalues  are  deter¬ 
mined  almost  exclusively  by  the  second  term,  and  are  therefore  c  times  the  nonzero 
eigenvalues  of  the  matrix  Vh(x^)rrVh(x^).  We  would  like  to  select  T  so  that  these 
eigenvalues  are  not  spread  out  but  are  nearly  equal  to  one  another.  An  ideal  choice 
for  the  Mi  iteration  would  be 


T  =  [Vh(x/fc)Vh(x/t)r]-1,  (13.49) 

since  then  all  nonzero  eigenvalues  would  be  exactly  equal.  However,  we  do  not 
allow  to  change  at  each  step,  and  therefore  compromise  by  setting 

T  =  [Vh(x0)Vh(xo)r]-1,  (13.50) 

where  xq  is  the  initial  point  of  the  iteration. 

Using  this  penalty  function,  the  corresponding  eigenvalue  structure  will  at  any 
point  look  approximately  like  that  shown  in  Fig.  13.5.  The  eigenvalues  are  bunched 
into  two  separate  groups.  As  c  is  increased  the  smaller  eigenvalues  move  into  the 
interval  [a.  A]  where  a  and  A  are,  as  usual,  the  smallest  and  largest  eigenvalues  of 
Lm  at  the  solution  to  (13.45).  The  larger  eigenvalues  move  forward  to  the  right  and 
spread  further  apart. 


_  ^ 

[j  l 

\  i 

w  ^ 

Fig.  13.5  Eigenvalue  distributions 


Using  the  result  of  Exercise  11,  Chap.  9,  we  see  that  if  x^+i  is  determined  from 
Xk  by  two  conjugate  gradient  steps,  the  rate  of  convergence  will  be  linear  at  a  ratio 
determined  by  the  widest  of  the  two  eigenvalue  groups.  If  our  normalization  is  suf¬ 
ficiently  accurate,  the  large-valued  group  will  have  the  lesser  width.  In  that  case 
convergence  of  this  scheme  is  approximately  that  of  the  canonical  rate  for  the  origi¬ 
nal  problem.  Thus,  by  proper  normalization  it  is  possible  to  obtain  the  canonical  rate 
of  convergence  for  only  about  twice  the  time  per  iteration  as  required  by  steepest 
descent. 

There  are,  of  course,  numerous  variations  of  this  method  that  can  be  used  in 
practice.  T  can,  for  example,  be  allowed  to  vary  at  each  step,  or  it  can  be  occasionally 
updated. 

Example.  The  example  problem  presented  in  the  previous  section  was  also  solved 
by  the  normalization  method  presented  above.  The  results  for  various  values  of  c 
and  for  cycle  lengths  of  one,  two,  and  three  are  presented  in  Table  13.2.  (All  runs 
were  initiated  from  the  zero  vector.) 
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Table  13.2  Results  for  example  3 


c  —  10 


c  —  100 


c  =  1000 


p  (steps  Number  No.  of  steps  Value  of  modified  objective 

per  cycle)  of  cycles 


to  conver¬ 
gence 


(1 

28 

28 

251.2657 

<  2 

9 

18 

251.2657 

5 

15 

251.2657 

13 

[1 

153 

153 

379.5955 

<  2 

13 

26 

379.5955 

11 

33 

379.5955 

13 

[1 

261a 

261 

402.0903 

<  2 

14 

28 

400.1687 

13 

39 

400.1687 

3 

a  Program  not  run  to  convergence  due  to  excessive  time 


13.7  Penalty  Functions  and  Gradient  Projection 

The  penalty  function  method  can  be  combined  with  the  idea  of  the  gradient  projec¬ 
tion  method  to  yield  an  attractive  general  purpose  procedure  for  solving  constrained 
optimization  problems.  The  proposed  combination  method  can  be  viewed  either  as  a 
way  of  accelerating  the  rate  of  convergence  of  the  penalty  function  method  by  elim¬ 
inating  the  effect  of  the  large  eigenvalues,  or  as  a  technique  for  efficiently  handling 
the  delicate  and  usually  cumbersome  requirement  in  the  gradient  projection  method 
that  each  point  be  feasible.  The  combined  method  converges  at  the  canonical  rate 
(the  same  as  does  the  gradient  projection  method),  is  globally  convergent  (unlike 
the  gradient  projection  method),  and  avoids  much  of  the  computational  difficulty 
associated  with  staying  feasible. 


Underlying  Concept 


The  basic  theoretical  result  that  motivates  the  development  of  this  algorithm  is  the 
Combined  Steepest  Descent  and  Newton’s  Method  Theorem  of  Sect.  10.7.  The  idea 
is  to  apply  this  combined  method  to  a  penalty  problem.  For  simplicity  we  first  con¬ 
sider  the  equality  constrained  problem 

minimize  /(x) 
subject  to  h(x)  =  0, 

where  x  e  En,  h(x)  e  Em.  The  associated  unconstrained  penalty  problem  that  we 
consider  is 


minimize  q(x). 


(13.52) 
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where 

q(x)  =  f(x)+  lc|h(x)|2. 

At  any  point  let  M(x^)  be  the  subspace  tangent  to  the  surface  Sk  =  {x  :  h(x)  = 
h(x^)}.  This  is  a  slight  extension  of  the  tangent  subspaces  that  we  have  considered 
before,  since  M(x^)  is  defined  even  for  points  that  are  not  feasible.  If  the  sequence 
{xk}  converges  to  a  solution  xc  of  problem  (13.52),  then  we  expect  that  M(x^)  will 
in  some  sense  converge  to  M(xc).  The  orthogonal  complement  of  M(x^)  is  the  space 
generated  by  the  gradients  of  the  constraint  functions  evaluated  at  x^.  Let  us  denote 
this  space  by  N(xk).  The  idea  of  the  algorithm  is  to  take  A  as  the  subspace  over 
which  Newton’s  method  is  applied,  and  M  as  the  space  over  which  the  gradient 
method  is  applied.  A  cycle  of  the  algorithm  would  be  as  follows: 

1 .  Given  x^,  apply  one  step  of  Newton’s  method  over,  the  subspace  A(x^)  to  obtain 
a  point  Wk  of  the  form 


Wjfc  =  xk  +  Vh(x/fc)ru/fc 
u*  e  Em. 

2.  From  w^,  take  an  ordinary  steepest  descent  step  to  obtain  x^+i. 

Of  course,  we  must  show  how  Step  1  can  be  easily  executed,  and  this  is  done  below, 
but  first,  without  drawing  out  the  details,  let  us  examine  the  general  structure  of  this 
algorithm. 

The  process  is  illustrated  in  Fig.  13.6.  The  first  step  is  analogous  to  the  step 
in  the  gradient  projection  method  that  returns  to  the  feasible  surface;  except  that 
here  the  criterion  is  reduction  of  the  objective  function  rather  than  satisfaction  of 
constraints.  To  interpret  the  second  step,  suppose  for  the  moment  that  the  origi¬ 
nal  problem  (13.51)  has  a  quadratic  objective  and  linear  constraints;  so  that,  con¬ 
sequently,  the  penalty  problem  (13.52)  has  a  quadratic  objective  and  A(x),  Mix) 
and  Vh(x)  are  independent  of  x.  In  that  case  the  first  (Newton)  step  would  exactly 
minimize  q  with  respect  to  A,  so  that  the  gradient  of  q  at  would  be  orthog¬ 
onal  to  A;  that  is,  the  gradient  would  lie  in  the  subspace  M.  Furthermore,  since 
Vg(w,0  =  V/(Wfc)  +  ch(w*;)Vh(wjO,  we  see  that  V^(w^)  would  in  that  case  be  equal 
to  the  projection  of  the  gradient  of  /  onto  M.  Hence,  the  second  step  is,  in  the 
quadratic  case  exactly,  and  in  the  general  case  approximately,  a  move  in  the  direc¬ 
tion  of  the  projected  negative  gradient  of  the  original  objective  function. 

The  convergence  properties  of  such  a  scheme  are  easily  predicted  from  the 
theorem  on  the  Combined  Steepest  Descent  and  Newton’s  Method,  in  Sect.  10.7, 
and  our  analysis  of  the  structure  of  the  Hessian  of  the  penalty  objective  function 
given  by  (13.26).  As  x^  — >  xc  the  rate  will  be  determined  by  the  ratio  of  largest  to 
smallest  eigenvalues  of  the  Hessian  restricted  to  M(xc). 

This  leads,  however,  by  what  was  shown  in  Sect.  12.3,  to  approximately  the 
canonical  rate  for  problem  (13.51).  Thus  this  combined  method  will  yield  again 
the  canonical  rate  as  c  — >  oo. 
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Fig.  13.6  Illustration  of  the  method 


Implementing  the  First  Step 

To  implement  the  first  step  of  the  algorithm  suggested  above  it  is  necessary  to  show 
how  a  Newton  step  can  be  taken  in  the  subspace  N(Xk).  We  show  that,  again  for 
large  values  of  c,  this  can  be  accomplished  easily. 

At  the  point  the  function  b ,  defined  by 

b(u)  =  q(xk  +  Vh(xj:)Tu)  (13.53) 

for  u  e  Em,  measures  the  variations  in  q  with  respect  to  displacements  in  N(Xk). 
We  shall,  for  simplicity,  assume  that  at  each  point,  x^,  Vh(x^)  has  rank  m.  We  can 
immediately  calculate  the  gradient  with  respect  to  u, 

Vfo(u)  =  Vq(xk  +  Vh(xk)Tu)Vh(xk)T,  ( 1 3.54) 

and  the  m  x  n  Hessian  with  respect  to  u  at  u  =  0, 

B  =  Vh(xt)Q(x,)Vh(Xit)r.  (13.55) 

where  Q  is  the  nxn  Hessian  of  q  with  respect  to  x.  From  (13.26)  we  have  that  at  x^ 

Q(xk)  =  Lk(xk)  +  cVh(x,)rVh(x,).  (13.56) 

And  given  B,  the  direction  for  the  Newton  step  in  N  would  be 

d,  =  -Vh(x,)rB-1Vc(0)r 

=  -  Vh(x^)rB” 1  Vh(xk)V  q(xk)T .  (13.57) 

It  is  clear  from  (13.55)  and  (13.56)  that  exact  evaluation  of  the  Newton  step 
requires  knowledge  of  L(x^)  which  usually  is  costly  to  obtain.  For  large  values  of  c, 
however,  B  can  be  approximated  by 


B  -  c[Vh(x,)Vh(x,)r]2, 


(13.58) 
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and  hence  a  good  approximation  to  the  Newton  direction  is 

dk  =  --Vh(xk)Tm(xk)Vh(xk)Tr2Vh(xk)Vq(xk)T.  (13.59) 

c 

Thus  a  suitable  implementation  of  one  cycle  of  the  algorithm  is: 

1.  Calculate 

d,  =  -ivh(x04Vh(x0Vh(x00"2Vh(x0Vg(x*)r. 

c 

2.  Find  fa  to  minimize  q(xk  +  J3dk)  (using  fa  =  1  as  an  initial  search  point),  and  set 
wk  =  xk  +fadk. 

3.  Calculate  pyt  =  -Vq(wk)T. 

4.  Find  ak  to  minimize  q(wk  +  crp^),  and  set  x^+i  =  +  a^k. 

It  is  interesting  to  compare  the  Newton  step  of  this  version  of  the  algorithm  with 
the  step  for  returning  to  the  feasible  region  used  in  the  ordinary  gradient  projection 
method.  We  have 

Vq(xk)T  =  V/(x,)r  +  cVh(x,)rh(x,).  (13.60) 

If  we  neglect  Vf(xk)T  on  the  right  (as  would  be  valid  if  we  are  a  long  distance  from 
the  constraint  boundary)  then  the  vector  reduces  to 

d,  =  -Vh(x,)r[Vh(x,)Vh(x,)r]-1h(x,), 

which  is  precisely  the  first  estimate  used  to  return  to  the  boundary  in  the  gradient 
projection  method.  The  scheme  developed  in  this  section  can  therefore  be  regarded 
as  one  which  corrects  this  estimate  by  accounting  for  the  variation  in  /. 

An  important  advantage  of  the  present  method  is  that  it  is  not  necessary  to  carry 
out  the  search  in  detail.  If  /3  =  1  yields  an  improved  value  for  the  penalty  objective, 
no  further  search  is  required.  If  not,  one  need  search  only  until  some  improvement 
is  obtained.  At  worst,  if  this  search  is  poorly  performed,  the  method  degenerates 
to  steepest  descent.  When  one  finally  gets  close  to  the  solution,  however,  [3  =  1  is 
bound  to  yield  an  improvement  and  terminal  convergence  will  progress  at  nearly  the 
canonical  rate. 


Inequality  Constraints 

The  procedure  is  conceptually  the  same  for  problems  with  inequality  constraints. 
The  only  difference  is  that  at  the  beginning  of  each  cycle  the  subspace  M(xk)  is 
calculated  on  the  basis  of  those  constraints  that  are  either  active  or  violated  at  xk, 
the  others  being  ignored.  The  resulting  technique  is  a  descent  algorithm  in  that 
the  penalty  objective  function  decreases  at  each  cycle;  it  is  globally  convergent 
because  of  the  pure  gradient  step  taken  at  the  end  of  each  cycle;  its  rate  of  conver¬ 
gence  approaches  the  canonical  rate  for  the  original  constrained  problem  as  c  — ■>  oo; 
and  there  are  no  feasibility  tolerances  or  subroutine  iterations  required. 
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It  is  possible  to  construct  penalty  functions  that  are  exact  in  the  sense  that  the 
solution  of  the  penalty  problem  yields  the  exact  solution  to  the  original  problem 
for  a  finite  value  of  the  penalty  parameter.  With  these  functions  it  is  not  necessary 
to  solve  an  infinite  sequence  of  penalty  problems  to  obtain  the  correct  solution. 
However,  a  new  difficulty  introduced  by  these  penalty  functions  is  that  they  are  non- 
differentiable. 

For  the  general  constrained  problem 

minimize  /(x) 
subject  to  h(x)  =  0,  g(x)  <  0, 

consider  the  absolute-value  penalty  function 

i n  p 

P(x)  =  ^  \hi(x)\  +  max(0,  gj(x)). 
i=l  7=1 

The  penalty  problem  is  then,  as  usual, 

minimize  /(x)  +  cP(x) 


(13.61) 


(13.62) 


(13.63) 


for  some  positive  constant  c.  We  investigate  the  properties  of  the  absolute-value 
penalty  function  through  an  example  and  then  generalize  the  results. 

Example  1.  Consider  the  simple  quadratic  problem 

minimize  2x2  +  2xy  +  y2  -  2y 

subject  to  v  =  0.  (13.64) 


It  is  easy  to  solve  this  problem  directly  by  substituting  v  =  0  into  the  objective.  This 
leads  immediately  to  x  =  0,  y  =  1 . 

If  a  standard  quadratic  penalty  function  is  used,  we  minimize  the  objective 


2x2  +  2 xy  +  y2  -  2y  +  ^ cx ' 


(13.65) 


for  c  >  0.  The  solution  again  can  be  easily  found  and  is  x  =  -2/(2  +  c),  y  - 
1  -2/(2  -he).  This  solution  approaches  the  true  solution  as  c  — >  oo,  as  predicted  by 
the  general  theory.  However,  for  any  finite  c  the  solution  is  inexact. 


Now  let  us  use  the  absolute- value  penalty  function.  We  minimize  the  function 


2x2  +  2xy  +  y2  -  2y  +  c\x\. 


(13.66) 
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We  rewrite  (13.66)  as 

2x2  +  2xy  +  y1  -  2y  +  c\x\ 

=  2x 2  +  2xy  +  c\x\  +  (y  -  l)2  -  1 
=  2X2  +  2x  +  c\x\  +  (y-l)2  +  2 x(y  -  1)  -  1  (13.67) 

=  x2  +  (2x  +  c|jc|)  +  (y  -  1  +  x)2  -  1 . 

All  terms  (except  the  -/)  are  nonnegative  if  c  >  2.  Therefore,  the  minimum  value 
of  this  expression  is  -1,  which  is  achieved  (uniquely)  by  x  =  0,  y  =  1.  Therefore, 
for  c  >  2  the  minimum  point  of  the  penalty  problem  is  the  correct  solution  to  the 
original  problem  (13.64). 

We  let  the  reader  verify  that  X  -  -2  for  this  example.  The  fact  that  c  >  \A\  is 
required  for  the  solution  to  be  exact  is  an  illustration  of  a  general  result  given  by  the 
following  theorem. 

Exact  Penalty  Theorem.  Suppose  that  the  point  x*  satisfies  the  second-order  sufficiency 
conditions  for  a  local  minimum  of  the  constrained  problem  (13.61).  Let  A  and  p  be  the 
corresponding  Lagrange  multipliers.  Then  for  c  >  max{|/fj,  pj  :  i  =  1, 2,  . . . ,  m,  j  = 
1,2,  . . . ,  p},  x*  is  also  a  local  minimum  of  the  absolute- value  penalty  objective  (13.62). 

Proof.  For  simplicity  we  assume  that  there  are  equality  constraints  only.  Define  the 
primal  function 


oj(z)  =  min {/(x)  :  hfx)  =  Zi  for  i  =  1,2,  . . . ,  m}.  (13.68) 

X 


The  primal  function  was  introduced  in  Sect.  12.3.  Under  our  assumption  the  function 

T* 

exists  in  a  neighborhood  of  x*  and  is  continuously  differentiable,  with  Vtu(0)  =  -A  . 
Now  define 

m 

COc( z)  =  <o(z)  +  C  V  |z,|. 

i=  1 


Then  we  have 


m 


m 


min{/(x)  +  c  V  |/i,(x)|}  =  min{/(x)  +  cV  \zi\  :  h(x)  =  z) 

X  X.ll 

i=  1  i=  1 


i=  1 
m 


=  mm{p(z)  +  c  V  k,  |} 

11 

i- 1 


=  min  pc(  z). 

u 


By  the  Mean  Value  Theorem, 


oj(  z)  =  tu(0)  +  Vco(az)z 
for  some  or,  0  <  or  <  1.  Therefore, 

m 

COc( z)  =  OJ(0)  +  Vw(ft’Z)Z  +  C  2_.  \Zi 


(13.69) 
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We  know  that  VoX z)  is  continuous  at  0,  and  thus  given  s  >  0  there  is  a  neighborhood 
of  0  such  that  |Vcu(z);|  <  IX  |  +  s.  Thus 


m 


m 


Vw(az)z  =  2_j  Voj(az)iZi  >  -{max  |Vw(az);|)  ^  |z, 


i=  1 


i=  1 


m 


-{max(U,|  +  £)}  2_j  I Zi 


i=  1 


Using  this  in  (13.69),  we  obtain 


«c(z)  >  /)(0)  +  (c  -  £  -  max  |/t,|)  ^ 

I-  1 

For  c  >  £  +  max  |/fj  it  follows  that  ojc( z)  is  minimized  at  z  =  0.  Since  £  was  arbitrary, 
the  result  holds  for  c  >  max  |/fij. 

This  result  is  easily  extended  to  include  inequality  constraints.  (See  Exercise  16.) 

I 


It  is  possible  to  develop  a  geometric  interpretation  of  the  absolute- value  penalty 
function  analogous  to  the  interpretation  for  ordinary  penalty  functions  given  in 
Fig.  13.4.  Figure  13.7  corresponds  to  a  problem  for  a  single  constraint.  The  smooth 
curve  represents  the  primal  function  of  the  problem.  Its  value  at  0  is  the  value  of  the 
original  problem,  and  its  slope  at  0  is  -A.  The  function  toc(z)  is  obtained  by  adding 
c\z\  to  the  primal  function,  and  this  function  has  a  discontinuous  derivative  at  z  =  0. 
It  is  clear  that  for  c  >  \A\,  this  composite  function  has  a  minimum  at  exactly  z  =  0, 
corresponding  to  the  correct  solution. 

There  are  other  exact  penalty  functions  but,  like  the  absolute-value  penalty 
function,  most  are  nondifferentiable  at  the  solution.  Such  penalty  functions  are  for 
this  reason  difficult  to  use  directly;  special  descent  algorithms  for  nondifferentiable 
objective  functions  have  been  developed,  but  they  can  be  cumbersome.  Furthermore, 
although  these  penalty  functions  are  exact  for  a  large  enough  c,  it  is  not  known  at  the 
outset  what  magnitude  is  sufficient.  In  practice  a  progression  of  c’s  must  often  be 
used.  Because  of  these  difficulties,  the  major  use  of  exact  penalty  functions  in  non¬ 
linear  programming  is  as  merit  functions-me asur in g  the  progress  of  descent  but  not 
entering  into  the  determination  of  the  direction  of  movement.  This  idea  is  discussed 
in  Chap.  15. 


13.9  Summary 

Penalty  methods  approximate  a  constrained  problem  by  an  unconstrained  prob¬ 
lem  that  assigns  high  cost  to  points  that  are  far  from  the  feasible  region.  As  the 
approximation  is  made  more  exact  (by  letting  the  parameter  c  tend  to  infinity) 
the  solution  of  the  unconstrained  penalty  problem  approaches  the  solution  to  the 
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Fig.  13.7  Geometric  interpretation  of  absolute- value  penalty  function 


original  constrained  problem  from  outside  the  active  constraints.  Barrier  methods, 
on  the  other  hand,  approximate  a  constrained  problem  by  an  (essentially)  uncon¬ 
strained  problem  that  assigns  high  cost  to  being  near  the  boundary  of  the  feasible 
region,  but  unlike  penalty  methods,  these  methods  are  applicable  only  to  problems 
having  a  robust  feasible  region.  As  the  approximation  is  made  more  exact,  the 
solution  of  the  unconstrained  barrier  problem  approaches  the  solution  to  the  original 
constrained  problem  from  inside  the  feasible  region. 

The  objective  functions  of  all  penalty  and  barrier  methods  of  the  form  P(x)  = 
y(/t(x)),  B(x)  =  rj(g(x))  are  ill-conditioned.  If  they  are  differentiable,  then  as  c  — »  oo 
the  Hessian  (at  the  solution)  is  equal  to  the  sum  of  L,  the  Hessian  of  the  Lagrangian 
associated  with  the  original  constrained  problem,  and  a  matrix  of  rank  r  that  tends  to 
infinity  (where  r  is  the  number  of  active  constraints).  This  is  a  fundamental  property 
of  these  methods. 

Effective  exploitation  of  differentiable  penalty  and  barrier  functions  requires  that 
schemes  be  devised  that  eliminate  the  effect  of  the  associated  large  eigenvalues.  For 
this  purpose  the  three  general  principles  developed  in  earlier  chapters,  The  Partial 
Conjugate  Gradient  Method,  The  Modified  Newton  Method,  and  The  Combination 
of  Steepest  Descent  and  Newton’s  Method,  when  creatively  applied,  all  yield  meth¬ 
ods  that  converge  at  approximately  the  canonical  rate  associated  with  the  original 
constrained  problem. 

It  is  necessary  to  add  a  point  of  qualification  with  respect  to  some  of  the  algo¬ 
rithms  introduced  in  this  chapter,  lest  it  be  inferred  that  they  are  offered  as  panaceas 
for  the  general  programming  problem.  As  has  been  repeatedly  emphasized,  the  ideal 
study  of  convergence  is  a  careful  blend  of  analysis,  good  sense,  and  experimentation. 
The  rate  of  convergence  does  not  always  tell  the  whole  story,  although  it  is  often  a 
major  component  of  it.  Although  some  of  the  algorithms  presented  in  this  chapter 
asymptotically  achieve  the  canonical  rate  of  convergence  (at  least  approximately), 
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for  large  c  the  points  may  have  to  be  quite  close  to  the  solution  before  this  rate 
characterizes  the  process.  In  other  words,  for  large  c  the  process  may  converge 
slowly  in  its  initial  phase,  and,  to  obtain  a  truly  representative  analysis,  one  must 
look  beyond  the  first-order  convergence  properties  of  these  methods.  For  this  reason 
many  people  find  Newton’s  method  attractive,  although  the  work  at  each  step  can 
be  substantial. 


13.10  Exercises 

1.  Show  that  if  q(c ,  x)  is  continuous  (with  respect  to  x)  and  q(c ,  x)  — >  oo  as 
|x|  — >  oo,  then  q(c ,  x)  has  a  minimum. 

2.  Suppose  problem  (13.1),  with  /  continuous,  is  approximated  by  the  penalty 
problem  (13.2),  and  let  [ck]  be  an  increasing  sequence  of  positive  constants 
tending  to  infinity.  Define  q(c ,  x)  =  /(x)  +  cP(x ),  and  fix  s  >  0.  For  each  k  let 
Xk  be  determined  satisfying 

q(ck ,  xk)  <  [min  q(ck ,  x)]  +  s. 

X 

Show  that  if  x*  is  a  solution  to  (13.1),  any  limit  point,  x,  of  the  sequence  {x^}  is 
feasible  and  satisfies  /(x)  <  fix")  +  s. 

3.  Construct  an  example  problem  and  a  penalty  function  such  that,  as  c  — >  oo,  the 
solution  to  the  penalty  problem  diverges  to  infinity. 

4.  Combined  penalty  and  barrier  method.  Consider  a  problem  of  the  form 

minimize  /(x) 
subject  to  x  £  S  nT 

and  suppose  P  is  a  penalty  function  for  S  and  B  is  a  barrier  function  for  T. 
Define 

d(c ,  x)  =  /(x)  +  cP(x)  +  -#(x). 

c 

Let  {ck}  be  a  sequence  ck  — >  oo,  and  for  k  =  1, 2,  ...  let  xk  be  a  solution  to 

minimize  d(ck,  x) 

subject  to  x  e  interior  of  T.  Assume  all  functions  are  continuous,  T  is  compact 
(and  robust),  the  original  problem  has  a  solution  x*,  and  that  S  Pi  [interior  of  T ] 
is  not  empty.  Show  that 

(a)  limit  d(ck,  xk)  =  f(x*). 

keoo 

(b)  limit  ckP(xk )  =  0. 

keoo 

(c)  limit  j-B(xk)  =  0. 

keoo  Lk 
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5.  Prove  the  Theorem  at  the  end  of  Sect.  13.2. 

6.  Find  the  central  path  for  the  problem  of  minimizing  v2  subject  to  x  >  0. 

7.  Consider  a  penalty  function  for  the  equality  constraints 

h(x)  =  0,  h(x)  e  Em , 


having  the  form 

m 

P(x)  =  y(h(x))  =  w(hi(x)), 

i=  1 

where  w  is  a  function  whose  derivative  W  is  analytic  and  has  a  zero  of  order 
s  >  1  at  zero. 

(a)  Show  that  corresponding  to  (13.26)  we  have 

m 

Q(ck,  Xfc)  =  Lk(xk)  +  ck  '^{vv"(hi(xk))}Vhi(xk)TVhi(xk). 

i=  1 

(b)  Show  that  as  Ck  — >  oo,  m  eigenvalues  of  Q (c^  Xk)  have  magnitude  on  the 
order  of  (ck)l/s. 

8.  Corresponding  to  the  problem 

minimize  /(x) 
subject  to  g(x)  <  0, 

consider  the  sequence  of  unconstrained  problems 

minimize  /(x)  +  [g+(x)  +  1]*  -  1, 

and  suppose  x^  is  the  solution  to  the  £th  problem. 

(a)  Find  an  appropriate  definition  of  a  Lagrange  multiplier  /fi  to  associate 
with  Xk. 

(b)  Find  the  limiting  form  of  the  Hessian  of  the  associated  objective  function, 
and  determine  how  fast  the  largest  eigenvalues  tend  to  infinity. 

9.  Repeat  Exercise  8  for  the  sequence  of  unconstrained  problems 

minimize  f(x)  +  [(g(x)  +  i)+]k. 


10.  Morrison  ' s  method.  Suppose  the  problem 


minimize  /(x) 
subject  to  h(x)  =  0 


(13.70) 


has  solution  x*.  Let  M  be  an  optimistic  estimate  of  fix*),  that  is,  M  <  fix*). 
Define  v(M,  x)  =  [fix)  -  M]2  +  |h(x)|2  and  define  the  unconstrained  problem 


minimize  v(M,  x). 


(13.71) 
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Given  Mk  <  /(x*),  a  solution  xMh  to  the  corresponding  problem  (13.71)  is 
found,  then  Mk  is  updated  through 


Mk+l  =  Mk  +  |  v(Mk,  xM(.)]1/2  (13.72) 

and  the  process  repeated. 

(a)  Show  that  if  M  -  /(x*),  a  solution  to  (13.71)  is  a  solution  to  (13.70). 

(b)  Show  that  if  xM  is  a  solution  to  (13.71),  then  f(xM )  <  /(x*). 

(c)  Show  that  if  Mk  <  /(x*)  then  Mk+\  determined  by  (13.72)  satisfies 
Mk+i  <  /(x*). 

(d)  Show  that  Mk  — >  /(x*). 

(e)  Find  the  Hessian  of  v(M,  x)  (with  respect  to  x*).  Show  that,  to  within  a  scale 
factor,  it  is  identical  to  that  associated  with  the  standard  penalty  function 
method. 

1 1 .  Let  A  be  an  m  x  n  matrix  of  rank  m.  Prove  the  matrix  identity 

[I  +  A^Ar1  =  I  -  Ar  [I  +  AArr1A 

and  discuss  how  it  can  be  used  in  conjunction  with  the  method  of  Sect.  13.4. 

12.  Show  that  in  the  limit  of  large  c,  a  single  cycle  of  the  normalization  method  of 
Sect.  13.6  is  exactly  the  same  as  a  single  cycle  of  the  combined  penalty  function 
and  gradient  projection  method  of  Sect.  13.7. 

13.  Suppose  that  at  some  step  k  of  the  combined  penalty  function  and  gradient  pro¬ 
jection  method,  the  m  x  n  matrix  Vh(x^)  is  not  of  rank  m.  Show  how  the  method 
can  be  continued  by  temporarily  executing  the  Newton  step  over  a  subspace  of 
dimension  less  than  m. 

14.  For  a  problem  with  equality  constraints,  show  that  in  the  combined  penalty 
function  and  gradient  projection  method  the  second  step  (the  steepest  descent 
step)  can  be  replaced  by  a  step  in  the  direction  of  the  negative  projected  gradient 
(projected  onto  Mk)  without  destroying  the  global  convergence  property  and 
without  changing  the  rate  of  convergence. 

15.  Develop  a  method  that  is  analogous  to  that  of  Sect.  13.7,  but  which  is  a  combi¬ 
nation  of  penalty  functions  and  the  reduced  gradient  method.  Establish  that  the 
rate  of  convergence  of  the  method  is  identical  to  that  of  the  reduced  gradient 
method. 

16.  Extend  the  result  of  the  Exact  Penalty  Theorem  of  Sect.  13.8  to  inequalities. 
Write  gj{x)  <  0  in  the  form  of  an  equality  as  gj(x)  +  y1.  -  0  and  show  that  the 
original  theorem  applies. 

17.  Develop  a  result  analogous  to  that  of  the  Exact  Penalty  Theorem  of  Sect.  13.8 
for  the  penalty  function 


P(x)  =  max{0,  gi(x),  g2(x),  ...,  gp(x),  |^(x)|,  |ft2(x)|,  \hm(x)\}. 
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18.  Solve  the  problem 

minimize  v2  +  xy  +  y1  -  2 y 
subject  to  x  +  y  =  2 

three  ways  analytically 

(a)  with  the  necessary  conditions. 

(b)  with  a  quadratic  penalty  function. 

(c)  with  an  exact  penalty  function. 
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Chapter  14 

Duality  and  Dual  Methods 


We  first  derive  the  duality  theory  of  for  constrained  optimization,  which  is  based 
on  our  earlier  zero-order  optimality  conditions  and  the  Lagrangian  relaxations.  The 
variables  of  the  dual  are  typically  the  Lagrange  multipliers  associated  with  the  con¬ 
straints  in  the  primal  problem — the  original  constrained  optimization  problem. 

Thus,  dual  methods  are  based  on  the  viewpoint  that  it  is  the  Lagrange  multipliers 
which  are  the  fundamental  unknowns  associated  with  a  constrained  problem;  once 
these  multipliers  are  known  determination  of  the  solution  point  is  simple  (at  least 
in  some  situations).  Dual  methods,  therefore,  do  not  attack  the  original  constrained 
problem  directly  but  instead  attack  an  alternate  problem,  the  dual  problem,  whose 
unknowns  are  the  Lagrange  multipliers  of  the  first  problem.  For  a  problem  with  n 
variables  and  m  equality  constraints,  dual  methods  thus  work  in  the  m- dimensional 
space  of  Lagrange  multipliers.  Because  Lagrange  multipliers  measure  sensitivities 
and  hence  often  have  meaningful  intuitive  interpretations  as  prices  associated  with 
constraint  resources,  searching  for  these  multipliers,  is  often,  in  the  context  of  a 
given  practical  problem,  as  appealing  as  searching  for  the  values  of  the  original 
problem  variables. 

The  study  of  dual  methods,  and  more  particularly  the  introduction  of  the  dual 
problem,  precipitates  some  extensions  of  earlier  concepts.  One  interesting  feature  of 
this  chapter  is  the  calculation  of  the  Hessian  of  the  dual  problem  and  the  discovery 
of  a  dual  canonical  convergence  ratio  associated  with  a  constrained  problem  that 
governs  the  convergence  of  steepest  ascent  applied  to  the  dual. 

The  convergence  ratio  theory  lead  to  a  popular  method,  the  method  of  multipli¬ 
ers  based  on  the  augmented  Lagrangian,  in  which  the  Hessian  condition  would  be 
significantly  improved  to  facilitate  faster  convergence. 

The  alternate  direction  method  of  multipliers  is  based  on  an  idea  resembling 
that  in  the  coordinate  descent  method.  Here,  the  gradient  of  the  dual  is  calculated 
approximately  in  a  block  coordinate  fashion  using  primal  variables.  This  method  is 
particularly  effective  for  large-scale  optimization. 
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Cutting  plane  algorithms,  exceedingly  elementary  in  principle,  develop  a  series 
of  ever-improving  approximating  linear  programs,  whose  solutions  converge  to  the 
solution  of  the  original  problem.  The  methods  differ  only  in  the  manner  by  which 
an  improved  approximating  problem  is  constructed  once  a  solution  to  the  old  app¬ 
roximation  is  known.  The  theory  associated  with  these  algorithms  is,  unfortunately, 
scant  and  their  convergence  properties  are  not  particularly  attractive.  They  are, 
however,  often  very  easy  to  implement. 


14.1  Global  Duality 

Duality  in  nonlinear  programming  takes  its  most  elegant  form  when  it  is  formulated 
globally  in  terms  of  sets  and  hyperplanes  that  touch  those  sets.  This  theory  makes 
clear  the  role  of  Lagrange  multipliers  as  defining  hyperplanes  which  can  be  consid¬ 
ered  as  dual  to  points  in  a  vector  space.  The  theory  provides  a  symmetry  between 
primal  and  dual  problems  and  this  symmetry  can  be  considered  as  perfect  for  con¬ 
vex  problems.  For  non-convex  problems  the  “imperfection”  is  made  clear  by  the 
duality  gap  which  has  a  simple  geometric  interpretation.  The  global  theory,  which 
is  presented  in  this  section,  serves  as  useful  background  when  later  we  specialize  to 
a  local  duality  theory  that  can  be  used  even  without  convexity  and  which  is  central 
to  the  understanding  of  the  convergence  of  dual  algorithms. 

As  a  counterpoint  to  Sect.  11.9  where  equality  constraints  were  considered  before 
inequality  constraints,  here  we  shall  first  consider  a  problem  with  inequality  con¬ 
straints.  In  particular,  consider  the  problem 

minimize  /(x)  (14.1) 

subject  to  g(x)  <  0 

X  <G  12. 

12  c  En  is  a  convex  set,  and  the  functions  /  and  g  are  defined  on  12.  The  function  g 
is  p-dimensional.  The  problem  is  not  necessarily  convex,  but  we  assume  that  there 
is  a  feasible  point.  Recall  that  the  primal  function  associated  with  (14.1)  is  defined 
for  z  e  Ep  as 

(o(  z)  =  inf  {/(x)  :  g(x)  <z,xe  12},  (14.2) 

defined  by  letting  the  right  hand  side  of  inequality  constraint  take  on  arbitrary 
values.  It  is  understood  that  (14.2)  is  defined  on  the  set  D  =  {z  :  g(x)  <  z,  for 
some  x  £  12}. 

If  problem  (14.1)  has  a  solution  x*  with  value  /*  =  /(x*),  then  /*  is  the  point  on 
the  vertical  axis  in  Ep+l  where  the  primal  function  passes  through  the  axis.  If  (14.1) 
does  not  have  a  solution,  then  /*  =  inf{/(x)  :  g(x)  <0,  x  e  12}  is  the  intersection 
point. 


14.1  Global  Duality 


431 


The  duality  principle  is  derived  from  consideration  of  all  hyperplanes  that  lie 
below  the  primal  function.  As  illustrated  in  Fig.  14.1  the  intercept  with  the  vertical 
axis  of  such  a  hyperplanes  lies  below  (or  at)  the  value  /* . 

To  express  this  property  we  define  the  dual  function  defined  on  the  positive  cone 
in  Ep  as 

<p(v)  =  inf{/(x)  +  |/g(x) :  x  e  ft}.  (14.3) 


Fig.  14.1  Hyperplane  below  aj( z) 

In  general,  may  not  be  finite  throughout  the  positive  orthant  Ep+  but  the  region 
where  it  is  finite  is  convex. 

Proposition  1.  The  dual  function  is  concave  on  the  region  where  it  is  finite. 

Proof  Suppose  \ix,  \x2  are  in  the  finite  region,  and  let  0  <  a  <  1.  Then 

<K<*V\  +  (1  -  =  inf{/(x)  +  (a\ix  +  (1  -  a)^?  g(x) :  x  e  ft) 

>  inf{a/(xi)  +  ayf  g(xi)  :  xi  e  Q} 

+  inf{(l  -  a)f(x2)  +  (1  -  a)^g(x2) :  x2  e  O} 

=  a<p(\ix)  +  (1  -  a)4>(ii2).  I 

We  define  cp*  =  sup{0(|u)  :  |it  >  0}  where  it  is  understood  that  the  supremum  is 
taken  over  the  region  where  fi  is  finite.  We  can  now  state  the  weak  form  of  global 
duality. 

Weak  Duality  Proposition.  r  <  /*. 

Proof  For  every  \jl  >  0  we  have 

<p(fx)  =  inf{/(x)  +  |urg(x) :  x  e  £1} 

<  inf{/(x)  +  |i7'g(x) :  g(x)  <#,xefl| 

<  inf{/(x) :  g(x)  <0,  x  e  ft)  =  /*. 

Taking  the  supremum  over  the  left  hand  side  gives  </*•■ 
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Hence  the  dual  function  gives  lower  bounds  on  the  optimal  value  /* . 

This  dual  function  has  a  strong  geometric  interpretation.  Consider  a  p  +  1- 
dimensional  vector  (1,  ju)  <g  Ep+l  with  \x  >  0  and  a  constant  c.  The  set  of  vectors 
(r,  z)  such  that  the  inner  product  (1,  | u)r(r,  z)  =  r  +  \xTz  =  c  defines  a  hyperplane 
in  Ep+l.  Different  values  of  c  give  different  hyperplanes,  all  of  which  are  parallel. 

For  a  given  (1,  |ii)  we  consider  the  lowest  possible  hyperplane  of  this  form 
that  just  barely  touches  (supports)  the  region  above  the  primal  function  of  prob¬ 
lem  (14.1).  Suppose  xi  defines  the  touching  point  with  values  r  -  /(x i)  and 
z  =  g(xi).  Then  c  =  /(x i)  +  |Vg(xi)  = 

The  hyperplane  intersects  the  vertical  axis  at  a  point  of  the  form  (ro,  0).  This 
point  also  must  satisfy  (1,  |u)r(ro,  0)  =  c  =  cp(\x).  This  gives  c  =  ro.  Thus  the 
intercept  gives  /(  \x)  directly.  Thus  the  dual  function  at  \i  is  equal  to  the  intercept  of 
the  hyperplane  defined  by  (ii  that  just  touches  the  epigraph  of  the  primal  function. 


Fig.  14.2  The  highest  hyperplane 


Furthermore,  this  intercept  (and  dual  function  value)  is  maximized  by  the 
Lagrange  multiplier  which  corresponds  to  the  largest  possible  intercept,  at  a  point 
no  higher  than  the  optimal  value  /*.  See  Fig.  14.2. 

By  introducing  convexity  assumptions,  the  foregoing  analysis  can  be  strength¬ 
ened  to  give  the  strong  duality  theorem,  with  no  duality  gap  when  the  intercept  is 
at/*.  See  Fig.  14.3. 

We  shall  state  the  result  for  the  more  general  problem  that  includes  equality  con¬ 
straints  of  the  form  h(x)  =  0,  as  in  Sect.  1 1.9.  Specifically,  we  consider  the  problem 

maximize  /(x)  (14.4) 

subject  to  h(x)  =  0,  g(x)  <  0 

x  e  12 

where  h  is  affine  of  dimension  m,  g  is  convex  of  dimension  p ,  and  12  is  a  convex  set. 
In  this  case  the  dual  function  is 

4>(A,  |U)  =  inf{/(x)  +  Vh(x)  +  (irg(x) :  x  e  £1}. 
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And  let 

(p*  =  sup{0(T,  n)  :  A  £  Em ,  n  e  Ep ,  n  >  0}. 


Fig.  14.3  The  strong  duality  theorem.  There  is  no  duality  gap 


Strong  Duality  Theorem.  Suppose  in  the  problem  (14.4),  h  is  affine  and  regular  with  respect 
to  fl  and  there  is  a  point  Xj  e  12  with  that  h(x)  =  0  and  g(x)  <  0. 

Suppose  the  problem  has  solution  x*  with  value  f(x*)  =  /*.  Then  for  every  A  and  \x  >  0 
there  holds 

f<f- 

Furthermore,  there  are  A,  \x  >  0  such  that 

p)  =  r 

and  hence  0*  =  /*.  Moreover,  the  A  and  \x  above  are  Lagrange  multipliers  for  the  original 
problem. 

Proof.  The  proof  follows  almost  immediately  from  the  zero-order  Lagrange  theorem 
of  Sect.  1 1.9.  The  Lagrange  multipliers  of  that  theorem  give 

f*  =  max{/(x)  +  Trh(x)  +  |urg(x)  :  x  e  12} 

=  0(T,  n)  <  f  <  f. 

Equality  must  hold  across  the  inequalities,  which  establishes  the  results.  I 

As  a  nice  summary  we  can  place  the  primal  and  dual  problems  together  for  the 
problem  with  inequality  constraints. 

Primal  Dual 

/*  =  minimize  co( z)  =  maximize  0(|ii) 

subject  to  z  <  0  subject  to  \x  >  0. 
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Example  1  ( Quadratic  Program).  Consider  the  problem 

1  T 

minimize  -x  Ox 
2 

subject  to  Bx  -  b  <  0. 


(14.5) 


The  dual  function  is 


0(|n)  =  min  ^xrQx  +  |ur(Bx  -  b). 
x  2 

This  gives  the  necessary  conditions 

Qx  +  Br|u  =  0 

and  hence  x  =  -Q_1Br  \x.  Substituting  this  into  <p(\x)  gives 

=  -inrBQ-1BV-  nrb. 

Hence  the  dual  problem  is 

maximize  -  i[xrBQ_1Br[x  -  \iTh  (14.6) 

subject  to  |ii  >  0, 

which  is  also  a  quadratic  programming  problem.  If  this  problem  is  solved  for  (ii,  that 
|Li  will  be  the  Lagrange  multiplier  for  the  primal  problem  (14.5). 

Note  that  the  first-order  conditions  for  the  dual  problem  (14.6)  imply 

|V[-BQ-1Br|i  -  b]  =0, 
which  by  substituting  the  formula  for  x  is  equivalent  to 

Hr[Bx-b]  =  0. 

This  is  the  complementary  slackness  condition  for  the  original  (primal)  prob¬ 
lem  (14.5). 

Example  2  (Integer  Solutions).  Duality  gaps  may  arise  if  the  object  function  or  the 
constraint  functions  are  not  convex.  A  gap  may  also  arise  if  the  underlying  set  is  not 
convex.  This  is  characteristic,  for  example,  of  problems  in  which  the  components  of 
the  solution  vector  are  constrained  to  be  integers.  For  instance,  consider  the  problem 

minimize  x\  +  2x\ 
subject  to  x\  +  V2  >  1/2 

x\ ,  V2  nonnegative  integers 
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It  is  clear  that  the  solution  is  x\  =  1,  X2  =  0,  with  objective  value  /*  =  1.  To  put  this 
problem  in  the  standard  form  we  have  discussed,  we  write  the  constraint  as 

—x\  -  X2  +  1/2  <  z,  where  z  =  0. 

The  primal  function  cl>(z)  is  equal  to  0  for  z  >  1/2  since  then  x\  =  V2  =  0  is  feasible. 
The  entire  primal  function  has  steps  as  z  steps  negatively  integer  by  integer,  as  shown 
in  Fig.  14.4. 


Fig.  14.4  Duality  for  an  integer  problem 


The  dual  function  is 

9  9 

=  max{v1  +  x2  -  A(x\  +  X2  -  1/2)} 

where  the  maximum  is  taken  with  respect  to  the  integer  constraint.  Analytically,  the 
solution  for  small  values  of  //  is 

0(/i)  =  /i/2  forO  <  /i  <  1, 

=  1  -  /i/2  for  1  <  /i  <  2, 

:  and  more 

The  maximum  value  of  </>(//)  is  the  maximum  intercept  of  the  corresponding  hy¬ 
perplanes  (lines,  in  this  case)  with  the  vertical  axis.  This  occurs  for  /i  =  1  with 
a  corresponding  value  of  </>*  =  0(1)  =  1/2.  We  have  0*  <  f*  and  the  difference 
/*  -  0*  =  1/2  is  the  duality  gap. 


14.2  Local  Duality 

In  practice  the  mechanics  of  duality  are  frequently  carried  out  locally,  by  setting 
derivatives  to  zero,  or  moving  in  the  direction  of  a  gradient.  For  these  operations 
the  beautiful  global  theory  can  in  large  measure  be  replaced  by  a  weaker  but  often 
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more  useful  local  theory.  This  theory  requires  a  minimum  of  convexity  assumptions 
defined  locally.  We  present  such  a  theory  in  this  section,  since  it  is  in  keeping  with 
the  spirit  of  the  earlier  chapters  and  is  perhaps  the  simplest  way  to  develop  compu¬ 
tationally  useful  duality  results. 

As  often  done  before  for  convenience,  we  again  consider  nonlinear  programming 
problems  of  the  form 


minimize  /(x)  (14.7) 

subject  to  h(x)  =  0, 

where  x  e  En ,  h(x)  e  En  and  /,  h  e  C2.  Global  convexity  is  not  assumed  here. 
Everything  we  do  can  be  easily  extended  to  problems  having  inequality  as  well  as 
equality  constraints,  for  the  price  of  a  somewhat  more  involved  notation. 

We  focus  attention  on  a  local  solution  x*  of  (14.7).  Assuming  that  x*  is  a  regular 
point  of  the  constraints,  then,  as  we  know,  there  will  be  a  corresponding  Lagrange 
multiplier  (row)  vector  A*  such  that 

V  fix*)  +  (/i*)rVh(x*)  =  0,  (14.8) 

and  the  Hessian  of  the  Lagrangian 

L(x*)  =  F(x’)  +  (4*)rH(x*)  (14.9) 

must  be  positive  semidefinite  on  the  tangent  subspace 

M  =  {x  :  Vh(x*)x  =  0}. 

At  this  point  we  introduce  the  special  local  convexity  assumption  necessary  for 
the  development  of  the  local  duality  theory.  Specifically,  we  assume  that  the  Hessian 
L(x*)  is  positive  definite.  Of  course,  it  should  be  emphasized  that  by  this  we  mean 
L(x*)  is  positive  definite  on  the  whole  space  En ,  not  just  on  the  subspace  M.  The 
assumption  guarantees  that  the  Lagrangian  /(x)  =  /(x)  +  (/T)rh(x)  is  locally  convex 
at  x*. 

With  this  assumption,  the  point  x*  is  not  only  a  local  solution  to  the  constrained 
problem  (14.7);  it  is  also  a  local  solution  to  the  unconstrained  problem 

minimize  /(x)  +  (T*)rh(x),  (14.10) 

since  it  satisfies  the  first-  and  second-order  sufficiency  conditions  for  a  local  mini- 
mum  point.  Lurthermore,  for  any  A  sufficiently  close  to  A*  the  function  /(x)  +  A  h(x) 
will  have  a  local  minimum  point  at  a  point  x  near  x*.  This  follows  by  noting  that,  by 
the  Implicit  Lunction  Theorem,  the  equation 


V/(x)  +  4rVh(x)  =  0 


(14.11) 
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has  a  solution  x  near  x*  when  A  is  near  A* ,  because  L*  is  nonsingular;  and  by  the  fact 

'T' 

that,  at  this  solution  x,  the  Hessian  F(x)  +  A  H(x)  is  positive  definite.  Thus  locally 
there  is  a  unique  correspondence  between  A  and  x  through  solution  of  the  uncon¬ 
strained  problem 

minimize  /(x)  +  Trh(x).  (14.12) 

Furthermore,  this  correspondence  is  continuously  differentiable. 

Near  A*  we  define  the  dual  function  f  by  the  equation 

f(A)  =  minimum  [/(x)  +  Trh(x)],  (14.13) 

where  here  it  is  understood  that  the  minimum  is  taken  locally  with  respect  to  x 
near  x* .  We  are  then  able  to  show  (and  will  do  so  below)  that  locally  the  original 
constrained  problem  (14.7)  is  equivalent  to  unconstrained  local  maximization  of  the 
dual  function  f  with  respect  to  A.  Hence  we  establish  an  equivalence  between  a 
constrained  problem  in  x  and  an  unconstrained  problem  in  A. 

To  establish  the  duality  relation  we  must  prove  two  important  lemmas.  In  the 
statements  below  we  denote  by  x(A)  the  unique  solution  to  (14.12)  in  the  neighbor¬ 
hood  of  x*. 

Lemma  1.  The  dual  function  0  has  gradient 

V0(4)  =  h(x(/i))r  (14.14) 

Proof  We  have  explicitly,  from  (14.13), 

4>U)  =  f(x(A))  +  Arh(x(A». 

Thus 

VeMA)  =  [Vf(x(A))  +  /lrVh(x(/l))]Vx(/i)  +  h(x(A))7’. 

Since  the  first  term  on  the  right  vanishes  by  definition  of  x(4),  we  obtain  (14.14).  I 

Lemma  1  is  of  extreme  practical  importance,  since  it  shows  that  the  gradient  of 
the  dual  function  is  simple  to  calculate.  Once  the  dual  function  itself  is  evaluated, 
by  minimization  with  respect  to  x,  the  corresponding  h(x)r,  which  is  the  gradient, 
can  be  evaluated  without  further  calculation. 

The  Hessian  of  the  dual  function  can  be  expressed  in  terms  of  the  Hessian  of  the 
Lagrangian.  We  use  the  notation  L(x,  A)  =  F(x)  +  A  H(x),  explicitly  indicating  the 
dependence  on  A.  (We  continue  to  use  L(x*)  when  A  =  A*  is  understood.)  We  then 
have  the  following  lemma. 

Lemma  2.  The  Hessian  of  the  dual  function  is 

0>(4)  =  - Vh(x(/i))L-1  (x(4),  4)Vh(x(4))r.  (14.15) 

Proof  The  Hessian  is  the  derivative  of  the  gradient.  Thus,  by  Lemma  1, 


<F(T)  =  Vh(x(T))Vx(/i). 


(14.16) 
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By  definition  we  have 

Vf(x(A»  +  ArVh(x(A))  =  0, 
and  differentiating  this  with  respect  to  4  we  obtain 

L (x(4),  A)Vx(A)  +  Vh(x(4))r  =  0. 

Solving  for  Vx(4)  and  substituting  in  (14.16)  we  obtain  (14.15).  I 

Since  L_1(x(4))  is  positive  definite,  and  since  Vh(x(4))  is  of  full  rank  near  x\ 
we  have  as  an  immediate  consequence  of  Lemma  2  that  the  rax  ra  Hessian  of  0  is 
negative  definite.  As  might  be  expected,  this  Hessian  plays  a  dominant  role  in  the 
analysis  of  dual  methods. 

Local  Duality  Theorem.  Suppose  that  the  problem 

minimize  fix)  (14.17) 

subject  to  h(x)  =  0 

has  a  local  solution  atx *  with  corresponding  value  r*  and  Lagrange  multiplier  A* .  Suppose 
also  that  x*  is  a  regular  point  of  the  constraints  and  that  the  corresponding  Hessian  of  the 
Lagrangian  L*  =  L(x*)  is  positive  definite.  Then  the  dual  problem 

maximize  0(4)  (14.18) 

has  a  local  solution  at  A*  with  corresponding  value  r*  and  x*  as  the  point  corresponding  to 
A*  in  the  definition  off. 

Proof.  It  is  clear  that  x*  corresponds  to  A *  in  the  definition  of  0.  Now  at  A*  we  have 
by  Lemma  1 

V0(T)  =  h(x*)r  =  0, 

and  by  Lemma  2  the  Hessian  of  0  is  negative  definite.  Thus  A *  satisfies  the  first-  and 
second-order  sufficiency  conditions  for  an  unconstrained  maximum  point  of  0.  The 
corresponding  value  0(4*)  is  found  from  the  definition  of  0  to  be  r*.  I 

Example  1.  Consider  the  problem  in  two  variables 

minimize  -  xy 

subject  to  (x  -  3)2  +  y2  =  5. 

The  first-order  necessary  conditions  are 

—y  +  (2x  -  6)4  =  0 
-x  +  2y4  =  0 

together  with  the  constraint.  These  equations  have  a  solution  at 


x  =  4,  y  =  2,  4=1. 
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The  Hessian  of  the  corresponding  Lagrangian  is 


-1 

2 


Since  this  is  positive  definite,  we  conclude  that  the  solution  obtained  is  a  local 
minimum.  (It  can  be  shown,  in  fact,  that  it  is  the  global  solution.) 


Since  L  is  positive  definite,  we  can  apply  the  local  duality  theory  near  this 
solution.  We  define 


0(T)  =  min{-xy  +  A[(x  -  3)2  +  y2  -  5]}, 


which  leads  to 


m  = 


4 A  +  4 A3  -  80 A5 
(4 A2  -  l)2 


valid  for  A  >1  .  It  can  be  verified  that  cf>  has  a  local  maximum  at  A  -  1 


Inequality  Constraints 

For  problems  having  inequality  constraints  as  well  as  equality  constraints  the  above 
development  requires  only  minor  modification.  Consider  the  problem 

minimize  /(x) 

subject  to  h(x)  =  0  (14.19) 

g(x)  <  0, 

where  g(x)  e  Ep,  g  e  C2  and  everything  else  is  as  before.  Suppose  x*  is  a  local 
solution  of  (14.19)  and  is  a  regular  point  of  the  constraints.  Then,  as  we  know,  there 
are  Lagrange  multipliers  A*  and  p*  >  0  such  that 

V/(x*)  +  (A*)rVh(x*)  +  (|u*)TVg(x*)  =  0  (14.20) 

(p*)rg(x*)  =  0.  (14.21) 

We  impose  the  local  convexity  assumptions  that  the  Hessian  of  the  Lagrangian 

L(x*)  =  F(x*)  +  (/l*)rH(x*)  +  (|u*)rG(x*)  (14.22) 

is  positive  definite  (on  the  whole  space). 

For  A  and  p  >  0  near  A*  and  p*  we  define  the  dual  function 

<p(\,  |u)  =  min[/(x)  +  /h(x)  +  |/g(x)],  (14.23) 

where  the  minimum  is  taken  locally  near  x*.  Then,  it  is  easy  to  show,  paralleling  the 
development  above  for  equality  constraints,  that  (p  achieves  a  local  maximum  with 
respect  to  A,  p  >  0  at  A*,  p*. 
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Partial  Duality 

It  is  not  necessary  to  include  the  Lagrange  multipliers  of  all  the  constraints  of  a 
problem  in  the  definition  of  the  dual  function.  In  general,  if  the  local  convexity  ass¬ 
umption  holds,  local  duality  can  be  defined  with  respect  to  any  subset  of  functional 
constraints.  Thus,  for  example,  in  the  problem 

minimize  /(x) 

subject  to  h(x)  =  0  (14.24) 

g(x)  <  0, 

we  might  define  the  dual  function  with  respect  to  only  the  equality  constraints.  In 
this  case  we  would  define 


<f>(A)  =  min  {/(x)  +  Trh(x)},  (14.25) 

g(x)<0 

where  the  minimum  is  taken  locally  near  the  solution  x*  but  constrained  by  the 
remaining  constraints  g(x)  <  0.  Again,  the  dual  function  defined  in  this  way  will 
achieve  a  local  maximum  at  the  optimal  Lagrange  multiplier  A* .  The  partial  dual  is 
especially  useful  when  constraints  g(x)  <  0  are  simple  such  as  x  <  0  or  in  a  box. 


14.3  Canonical  Convergence  Rate  of  Dual  Steepest  Ascent 

Constrained  problems  satisfying  the  local  convexity  assumption  can  be  solved  by 
solving  the  associated  unconstrained  dual  problem,  and  any  of  the  standard  algo¬ 
rithms  discussed  in  Chaps.  7  through  10  can  be  used  for  this  purpose.  Of  course, 
the  method  that  suggests  itself  immediately  is  the  method  of  steepest  ascent.  It  can 
be  implemented  by  noting  that,  according  to  Lemma  1.  Section  14.2,  the  gradient 
of  0  is  available  almost  without  cost  once  0  itself  is  evaluated.  Without  some  spe¬ 
cial  properties,  however,  the  method  as  a  whole  can  be  extremely  costly  to  execute, 
since  every  evaluation  of  0  requires  the  solution  of  an  unconstrained  problem  in  the 
unknown  x.  Nevertheless,  as  shown  in  the  next  section,  many  important  problems 
do  have  a  structure  which  is  suited  to  this  approach. 

The  method  of  steepest  ascent,  and  other  gradient-based  algorithms,  when  applied 
to  the  dual  problem  will  have  a  convergence  rate  governed  by  the  eigenvalue  struc¬ 
ture  of  the  Hessian  of  the  dual  function  0.  At  the  Lagrange  multiplier  T*  correspond¬ 
ing  to  a  solution  x*  this  Hessian  is  (according  to  Lemma  2,  Sect.  13.1) 

O  =  -Vh(x*)(LT1Vh(x*)7'. 

This  expression  shows  that  <D  is  in  some  sense  a  restriction  of  the  matrix  (L*)-1  to  the 
subspace  spanned  by  the  gradients  of  the  constraint  functions,  which  is  the  orthog¬ 
onal  complement  of  the  tangent  subspace  M.  This  restriction  is  not  the  orthogonal 
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restriction  of  (L*)_1  onto  the  complement  of  M  since  the  particular  representation  of 
the  constraints  affects  the  structure  of  the  Hessian.  We  see,  however,  that  while  the 
convergence  of  primal  methods  is  governed  by  the  restriction  of  L*  to  M,  the  con¬ 
vergence  of  dual  methods  is  governed  by  a  restriction  of  (L*)-1  to  the  orthogonal 
complement  of  M. 

The  dual  canonical  convergence  rate  associated  with  the  original  constrained 
problem,  which  is  the  rate  of  convergence  of  steepest  ascent  applied  to  the  dual, 
is  ( B  -  b)2/(B  +  b )2  where  b  and  B  are,  respectively,  the  smallest  and  largest 
eigenvalues  of 

-O  =  Vh(x*)(LT1Vh(x*)7'. 

For  locally  convex  programming  problems,  this  rate  is  as  important  as  the  primal 
canonical  rate. 


Scaling 

We  conclude  this  section  by  pointing  out  a  kind  of  complementarity  that  exists 
between  the  primal  and  dual  rates.  Suppose  one  calculates  the  primal  and  dual 
canonical  rates  associated  with  the  locally  convex  constrained  problem 

minimize  /(x) 
subject  to  h(x)  =  0. 

If  a  change  of  primal  variables  x  is  introduced,  the  primal  rate  will  in  general  change 
but  the  dual  rate  will  not.  On  the  other  hand,  if  the  constraints  are  transformed  (by 
replacing  them  by  Th(x)  =  0  where  T  is  a  nonsingular  mxm  matrix),  the  dual  rate 
will  change  but  the  primal  rate  will  not. 


14.4  Separable  Problems  and  Their  Duals 

A  structure  that  arises  frequently  in  mathematical  programming  applications  is  that 
of  the  separable  problem: 


q 


minimize 


Y 


i=  1 

q 


subject  to  ^  h/(x;)  =  0 


(14.26) 


(14.27) 


i=  1 

q 


Y  &(X;)  ^  0 


i=  1 


(14.28) 
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In  this  formulation  the  components  of  the  ft- vector  x  are  partitioned  into  q  disjoint 
groups,  x  =  (xi,  X2,  . . . ,  xq)  where  the  groups  may  or  may  not  have  the  same 
number  of  components.  Both  the  objective  function  and  the  constraints  separate  into 
sums  of  functions  of  the  individual  groups.  For  each  i,  the  functions  h *,  and  g /  are 
twice  continuously  differentiable  functions  of  dimensions  1,  ftz,  and  p ,  respectively. 

Example  1.  Suppose  that  we  have  a  fixed  budget  of,  say,  A  dollars  that  may  be 
allocated  among  n  activities.  If  x;  dollars  is  allocated  to  the  ith  activity,  then  there 
will  be  a  benefit  (measured  in  some  units)  of  /(x,).  To  obtain  the  maximum  benefit 
within  our  budget,  we  solve  the  separable  problem 

n 

i—  1 

n 

Yjx‘<a  (14.29) 

i=1 

Xi  >  0. 

In  the  example  x  is  partitioned  into  its  individual  components. 

Example  2.  Problems  involving  a  series  of  decisions  made  at  distinct  times  are  often 
separable.  For  illustration,  consider  the  problem  of  scheduling  water  release  through 
a  dam  to  produce  as  much  electric  power  as  possible  over  a  given  time  interval 
while  satisfying  constraints  on  acceptable  water  levels.  A  discrete- time  model  of 
this  problem  is  to 

N 

maximize  z  f(y(k),  u(k )) 

k=  1 

subject  to  y(k)  =  y(k  -  1)  -  u(k)  +  s(k),  k-  1,  . . . ,  N 

c  <  y(k)  <  d,  k-  1 ,  . . . ,  N 
0  <  u(k ),  k  =  1,  . . . ,  N. 

Here  y(k)  represents  the  water  volume  behind  the  dam  at  the  end  of  period  k ,  u(k) 
represents  the  volume  flow  through  the  dam  during  period  k ,  and  s(k)  is  the  volume 
flowing  into  the  lake  behind  the  dam  during  period  k  from  upper  streams.  The  func¬ 
tion  /  gives  the  power  generation,  and  c  and  d  are  bounds  on  lake  volume.  The 
initial  volume  y(0)  is  given. 

In  this  example  we  consider  x  as  the  2A-dimensional  vector  of  unknowns 
y(k),  ft(k),  k-  1,2,  . . . ,  N.  This  vector  is  partitioned  into  the  pairs  x^  =  (y(k),  u(k)). 
The  objective  function  is  then  clearly  in  separable  form.  The  constraints  can  be 
viewed  as  being  in  the  form  (14.27)  with  h^(x^)  having  dimension  N  and  such  that 
h^(x^)  is  identically  zero  except  in  the  k  and  k  +  1  components. 


maximize 


subject  to 
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Decomposition 

Separable  problems  are  ideally  suited  to  dual  methods,  because  the  required  unconst¬ 
rained  minimization  decomposes  into  small  subproblems.  To  see  this  we  recall 
that  the  generally  most  difficult  aspect  of  a  dual  method  is  evaluation  of  the 
dual  function.  For  a  separable  problem,  if  we  associate  A  with  the  equality  con¬ 
straints  (14.27)  and  \i  >  0  with  the  inequality  constraints  (14.28),  the  required  dual 
function  is 

q 

(f>(A,  M-)  =  min  ^  +  Vh,(x,)  +  f/g;(x;)) . 

1=1 

This  minimization  problem  decomposes  into  the  q  separate  problems 


min  fj(Xj)  +  /l'  h,(x,)  +  |Vg,(x,). 

X/ 

The  solution  of  these  subproblems  can  usually  be  accomplished  relatively  effi¬ 
ciently,  since  they  are  of  smaller  dimension  than  the  original  problem. 

Example  3.  In  Example  1  using  duality  with  respect  to  the  budget  constraint,  the  ith 
subproblem  becomes,  for  p  >  0 


ma xfi(xi)  - pxi , 

Xi>  0 

which  is  only  a  one-dimensional  problem.  It  can  be  interpreted  as  setting  a  benefit 
value  p  for  dollars  and  then  maximizing  total  benefit  from  activity  i,  accounting  for 
the  dollar  expenditure. 

Example  4.  In  Example  1  using  duality  with  respect  to  the  equality  constraints  we 
denote  the  dual  variables  by  A(k),  k  -  1,2,  . . . ,  N.  The  kth  subproblem  becomes 

max  {f(y(k),  u(k))  +  [A{k  +  1)  -  A(k)]y(k)  -  A(k)[u(k)  -  s(/:)]} 

c<y(k)^d 
0  <:U(k) 

which  is  a  two-dimensional  optimization  problem.  Selection  of  A  e  EN  decom¬ 
poses  the  problem  into  separate  problems  for  each  time  period.  The  variable  A(k) 
can  be  regarded  as  a  value,  measured  in  units  of  power,  for  water  at  the  beginning 
of  period  k.  The  kth  subproblem  can  then  be  interpreted  as  that  faced  by  an  ent¬ 
repreneur  who  leased  the  dam  for  one  period.  He  can  buy  water  for  the  dam  at  the 
beginning  of  the  period  at  price  A(k)  and  sell  what  he  has  left  at  the  end  of  the  period 
at  price  A(k  +1).  His  problem  is  to  determine  y(k)  and  u(k )  so  that  his  net  profit,  acc¬ 
ruing  from  sale  of  generated  power  and  purchase  and  sale  of  water,  is  maximized. 

Example  5  (The  Hanging  Chain).  Consider  again  the  problem  of  finding  the  equi¬ 
librium  position  of  the  hanging  chain  considered  in  Example 4,  Sect.  11.3,  and 
Example  1,  Sect.  12.7.  The  problem  is 
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minimize 


n 


Z ctyi 

i=  1 


subject  to  z  y*  =  o 


i=i 


where  C[  —  n  -  i  +  L  =  16.  This  problem  is  locally  convex,  since  as  shown  in 
Sect.  12.7  the  Hessian  of  the  Lagrangian  is  positive  definite.  The  dual  function  is 
accordingly 

cf>(A,  n)  =  min Z  { +  'ty;  +  /•«  yl  -  y-  J  -  Lfi. 

i=  1  1 

Since  the  problem  is  separable,  the  minimization  divides  into  a  separate  minimiza¬ 
tion  for  each  yt,  yielding  the  equations 


or 


(ci  +  A)2(l-yf)=ii2yf. 


This  yields 


— ( Cj  +  d) 

[(Q  +  d^+yU2]1/2’ 


(14.30) 


The  above  represents  a  local  minimum  point  provided  /i  <  0;  and  the  minus  sign 
must  be  taken  for  consistency. 


The  dual  function  is  then 


<p(A,  n)  =  Z 

i=  1 


-(Cj  +  A)2 

[fe  +  d^+yU2]1/2 


-,1/2 


[(C/  +  d)2  +  /i2] 


or  finally,  using  yfjj2  =  -/i  for  fi  <  0, 

n  j - 

<f>U,  n)  =  -L/J  -  y(c,-  +  /l)2  +  yti2. 

1=1 


The  correct  values  of  d  and  /i  can  be  found  by  maximizing  0(d,  /i).  One  way  to  do 
this  is  to  use  steepest  ascent.  The  results  of  this  calculation,  starting  at  d  =  ji  -  0, 
are  shown  in  Table  14.1.  The  values  of  yt  can  then  be  found  from  (14.30). 
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Table  14.1  Results  of  dual  of  chain  problem 


Iteration 

Value 

Final  solution 

A  =  -10.00048 
g  =  -6.761136 

0 

-200.00000 

Vi  =  -0.8147154 

1 

-66.94638 

V2  =  -0.7825940 

2 

-66.61959 

V3  =  -0.7427243 

3 

-66.55867 

V4  =  -0.6930215 

4 

-66.54845 

V5  =  -0.6310140 

5 

-66.54683 

V6  =  -0.5540263 

6 

-66.54658 

V7  =  -0.4596696 

7 

-66.54654 

Vs  =  -0.3467526 

8 

-66.54653 

V9  =  -0.2165239 

9 

-66.54653 

yio  =  -0.0736802 

14.5  Augmented  Lagrangian 

One  of  the  most  effective  general  classes  of  nonlinear  programming  methods  is 
the  augmented  Lagrangian  methods,  alternatively  referred  to  as  methods  of  mul¬ 
tiplier.  These  methods  can  be  viewed  as  a  combination  of  penalty  functions  and 
local  duality  methods;  the  two  concepts  work  together  to  eliminate  many  of  the  dis¬ 
advantages  associated  with  either  method  alone.  The  augmented  Lagrangian  for  the 
equality  constrained  problem 

minimize  fix) 

subjectto  h(x)  =  0,  xeQ  (14.31) 

is  the  function  , 

4(X,  A)  =  /(x)  +  Vh(x)  +  -c|h(x)|2 

for  some  positive  constant  c.  We  shall  briefly  indicate  how  the  augmented  Lag¬ 
rangian  can  be  viewed  as  either  a  special  penalty  function  or  as  the  basis  for  a  dual 
problem.  These  two  viewpoints  are  then  explored  further  in  this  and  the  next  section. 

From  a  penalty  function  viewpoint  the  augmented  Lagrangian,  for  a  fixed  value 
of  the  vector  A,  is  simply  the  standard  quadratic  penalty  function  for  the  problem 

minimize  /(x)  +  A  h(x) 

subjectto  h(x)  =  0,  xeQ  (14.32) 

This  problem  is  clearly  equivalent  to  the  original  problem  (14.31),  since  combina¬ 
tions  of  the  constraints  adjoined  to  /(x)  do  not  affect  the  minimum  point  or  the 
minimum  value. 

A  typical  step  of  an  augmented  Lagrangian  method  starts  with  a  vector  A^.  Then 
x(Ak)  is  found  as  the  minimum  point  of 


(14.33) 
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Next  Ak  is  updated  to  Ak+\.  A  standard  method  for  the  update  is 


Xk+\  =  Xk  +  ch(x(4)). 

To  motivate  the  adjustment  procedure,  consider  Q  =  En  and  the  constrained  prob¬ 
lem  (14.32)  with  A  =  A/^.  The  Lagrange  multiplier  corresponding  to  this  prob¬ 
lem  is  A*  -  Ak,  where  A*  is  the  Lagrange  multiplier  of  (14.31).  On  the  other  hand 
since  (14.33)  is  the  penalty  function  corresponding  to  (14.32),  it  follows  from  the 
results  of  Sect.  13.3  that  ch(x(T^))  is  approximately  equal  to  the  Lagrange  multiplier 
of  (14.32).  Combining  these  two  facts,  we  obtain  ch(x(T^))  ^  A*  -  Ak.  Therefore,  a 
good  approximation  to  the  unknown  A*  is  Ak+i  =  Ak  +  ch(x(Ak)). 

Although  the  main  iteration  in  augmented  Lagrangian  methods  is  with  respect  to 
A,  the  penalty  parameter  c  may  also  be  adjusted  during  the  process.  As  in  ordinary 
penalty  function  methods,  the  sequence  of  c’s  is  usually  preselected;  c  is  either  held 
fixed,  is  increased  toward  a  finite  value,  or  tends  (slowly)  toward  infinity.  Since  in 
this  method  it  is  not  necessary  for  c  to  go  to  infinity,  and  in  fact  it  may  remain 
of  relatively  modest  value,  the  ill-conditioning  usually  associated  with  the  penalty 
function  approach  is  mediated. 

From  the  viewpoint  of  duality  theory,  the  augmented  Lagrangian  is  simply  the 
standard  Lagrangian  for  the  problem 

1  9 

minimize  /(x)  +  -c|h(x)| 

2 

subject  to  h(x)  =  0,  x  e  Q.  (14.34) 

This  problem  is  equivalent  to  the  original  problem  (14.31),  since  the  addition  of 
the  term  ^c|h(x)|2  to  the  objective  does  not  change  the  optimal  value,  the  opti¬ 
mum  solution  point,  nor  the  Lagrange  multiplier.  However,  whereas  the  original 
Lagrangian  may  not  be  convex  near  the  solution,  and  hence  the  standard  duality 
method  cannot  be  applied,  the  term  ^c|h(x)|2  tends  to  “convexify”  the  Lagrangian. 
For  sufficiently  large  c ,  the  Lagrangian  will  indeed  be  locally  convex.  Thus  the 
duality  method  can  be  employed,  and  the  corresponding  dual  problem  can  be  solved 
by  an  iterative  process  in  A.  This  viewpoint  leads  to  the  development  of  additional 
multiplier  adjustment  processes. 


The  Penalty  Viewpoint 

We  begin  our  more  detailed  analysis  of  augmented  Lagrangian  methods  by  showing 
that  if  the  penalty  parameter  c  is  sufficiently  large,  the  augmented  Lagrangian  has  a 
local  minimum  point  near  the  true  optimal  point.  This  follows  from  the  following 
simple  lemma.  (Again,  we  consider  Q  =  En  for  simplicity.) 

Lemma.  Let  A  and  B  be  nxn  symmetric  matrices.  Suppose  that  B  is  positive  semi-definite 
and  that  A  is  positive  definite  on  the  sub  space  Bx  =  0.  Then  there  is  a  c*  such  that  for  all 
c  >  c*the  matrix  A  +  cB  is  positive  definite. 
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Proof.  Suppose  to  the  contrary  that  for  every  k  there  were  an  with  |x^|  =  1  such 
that  x^(A  +  kB)Xk  <  0.  The  sequence  {x^}  must  have  a  convergent  subsequence 

converging  to  a  limit  x.  Now  since  xk  Bx^  >  0,  it  follows  that  x  Bx  =  0.  It  also 

follows  that  xrAx  <  0.  However,  this  contradicts  the  hypothesis  of  the  lemma.  I 

This  lemma  applies  directly  to  the  Hessian  of  the  augmented  Lagrangian  evalu¬ 
ated  at  the  optimal  solution  pair  x*,  A*.  We  assume  as  usual  that  the  second-order 
sufficiency  conditions  for  a  constrained  minimum  hold  at  x*,  X.  The  Hessian  of  the 
augmented  Lagrangian  evaluated  at  the  optimal  pair  x*,  A*  is 

Lc(x*,  A*)  =  F(x*)  +  (/l*)rH(x*)  +  cVh(x*)rVh(x*) 

=  L(x*)  +  cVh(x*)rVh(x*). 

The  first  term,  the  Hessian  of  the  normal  Lagrangian,  is  positive  definite  on  the  sub¬ 
space  Vh(x*)x  =  0.  This  corresponds  to  the  matrix  A  in  the  lemma.  The  matrix 
Vh(x*)rVh(x*)  is  positive  semi-definite  and  corresponds  to  B  in  the  lemma.  It 
follows  that  there  is  a  c*  such  that  for  all  c  >  c*,  Lc(x*,  A*)  is  positive  definite. 
This  leads  directly  to  the  first  basic  result  concerning  augmented  Lagrangian. 

Proposition  1.  Assume  that  the  second-order  sufficiency  conditions  for  a  local  minimum 
are  satisfied  at  x*,  A*.  Then  there  is  a  c*  such  that  for  all  c  >  c* ,  the  augmented  Lagrangian 
lc(x,  A*)  has  a  local  minimum  point  at  x*. 

By  a  continuity  argument  the  result  of  the  above  proposition  can  be  extended  to 
a  neighborhood  around  x*,  A*.  That  is,  for  any  A  near  A*,  the  augmented  Lagrangian 
has  a  unique  local  minimum  point  near  x*.  This  correspondence  defines  a  continuous 
function.  If  a  value  of  A  can  be  found  such  that  h(x(T))  =  0,  then  that  A  must  in 
fact  be  X\  since  x(A)  satisfies  the  necessary  conditions  of  the  original  problem. 
Therefore,  the  problem  of  determining  the  proper  value  of  A  can  be  viewed  as  one 
of  solving  the  equation  h(x(T))  =  0.  For  this  purpose  the  iterative  process 


Ak+i  =Ak  +  ch(x(Ak)), 


is  a  method  of  successive  approximation.  This  process  will  converge  linearly  in  a 
neighborhood  around  X\  although  a  rigorous  proof  is  somewhat  complex.  We  shall 
give  more  definite  convergence  results  when  we  consider  the  duality  viewpoint. 

Example  1.  Consider  the  simple  quadratic  problem  studied  in  Sect.  13.8 

minimize  2x2  +  2xy  +  y2  -  2y 
subject  to  v  =  0. 


The  augmented  Lagrangian  for  this  problem  is 


lc(x ,  y.  A)  =  2x2  +  2xy  +  y2  -  2y  +  Ax  +  -cxz 

*2j 
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The  minimum  of  this  can  be  found  analytically  to  be  v  =  -(2  +  T)/(2  +  c),  y  - 
(4  +  c  +  A)/ (2  +  c).  Since  h(x ,  y)  =  v  in  this  example,  it  follows  that  the  iterative 
process  for  Ak  is 

i  c(2  +  Ak) 

2  +  c 


or 


2  +  c 


2c 

2  +  c 


This  converges  to  T  =  -2  for  any  c  >  0.  The  coefficient  2/(2  +  c)  governs  the  rate 
of  convergence,  and  clearly,  as  c  is  increased  the  rate  improves. 


Geometric  Interpretation 

The  augmented  Lagrangian  method  can  be  interpreted  geometrically  in  terms  of  the 
primal  function  in  a  manner  analogous  to  that  in  Sects.  13.3  and  13.8  for  the  ordinary 
quadratic  penalty  function  and  the  absolute- value  penalty  function.  Consider  again 
the  primal  function  <z>(y)  defined  as 

u(y)  =  min {/(x) :  h(x)  =  y(, 

where  the  minimum  is  understood  to  be  taken  locally  near  x*.  We  remind  the 
reader  that  <u(0)  =  /(x*)  and  that  Vtu(0)r  =  -A*.  The  minimum  of  the  augmented 
Lagrangian  at  step  k  can  be  expressed  in  terms  of  the  primal  function  as  follows: 

min  lc(x,  Ak)  =  min{/(x)  +  /i[h(x)  +  L|h(x)|2} 

X 

=  min{/(x)  +  A[y  +  ]-c\y\2 :  h(x)  =  y)  (14.35) 

x,u  Z 

=  minjtiXy)  +  ATky  +  Lyl2}, 

u  Z 

where  the  minimization  with  respect  to  y  is  to  be  taken  locally  near  y  =  0.  This  min¬ 
imization  is  illustrated  geometrically  for  the  case  of  a  single  constraint  in  Fig.  14.5. 
The  lower  curve  represents  cu(y),  and  the  upper  curve  represents  <z>(y)  +  \c\y\2.  The 
minimum  point  y^  of  (14.30)  occurs  at  the  point  where  this  upper  curve  has  slope 
equal  to  -Ak.  It  is  seen  that  for  c  sufficiently  large  this  curve  will  be  convex  at  y  =  0. 
If  Ak  is  close  to  A *,  it  is  clear  that  this  minimum  point  will  be  close  to  0;  it  will  be 
exact  if  Ak  =  A*. 

The  process  for  updating  Ak  is  also  illustrated  in  Fig.  14.5.  Note  that  in  general,  if 

rr 

x(Ak)  minimizes  /c(x,  Ak),  then  y^  =  h(x(T^))  is  the  minimum  point  of  co(y)  +  Ak  y  + 
^c|y|2.  At  that  point  we  have  as  before 


Vm(yJt)7'  +  cyk  =  -Ak 
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Fig.  14.5  Primal  function  and  augmented  Lagrangian 


or  equivalently, 

Voj(yk)T  =  -(A*  +  cyk )  =  ~(Ak  +  ch(x(4))). 

It  follows  that  for  the  next  multiplier  we  have 

Ak+ 1  =  Ak  +  ch(x(^))  =  -Vw(yt)r, 

as  shown  in  Fig.  14.5  for  the  one-dimensional  case.  In  the  figure  the  next  point  y^+i 
is  the  point  where  tu(y)  +  |c|y|2  has  slope  -T^+i,  which  will  yield  a  positive  value  of 
yk+\  in  this  case.  It  can  be  seen  that  if  Ak  is  sufficiently  close  to  A*,  then  Ak+i  will  be 
even  closer,  and  the  iterative  process  will  converge. 


14.6  The  Method  of  Multipliers 

In  the  augmented  Lagrangian  method  (the  method  of  multipliers),  the  primary 
iteration  is  with  respect  to  A,  and  therefore  it  is  most  natural  to  consider  the  method 
from  the  dual  viewpoint.  This  is  in  fact  the  more  powerful  viewpoint  and  leads  to 
improvements  in  the  algorithm. 
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As  we  observed  earlier,  the  constrained  problem 

minimize  /(x) 

subject  to  h(x)  =  0,  xeQ  (14.36) 

is  equivalent  to  the  problem 

minimi“  m  *  r|hW|J 

subject  to  h(x)  =  0,  xeQ  (14.37) 

in  the  sense  that  the  solution  points,  the  optimal  values,  and  the  Lagrange  multipliers 
are  the  same  for  both  problems.  However,  as  spelled  out  by  Proposition  1  of  the  pre¬ 
vious  section,  whereas  problem  (14.36)  may  not  be  locally  convex,  problem  (14.37) 
is  locally  convex  for  sufficiently  large  c;  specifically,  the  Hessian  of  the  Lagrangian 
is  positive  definite  at  the  solution  pair  x\  A*.  Thus  local  duality  theory  is  applicable 
to  problem  (14.37)  for  sufficiently  large  c. 

To  apply  the  dual  method  to  (14.37),  we  define  the  dual  function 

<p(A)  =  min  {/(x)  +  Trh(x)  +  ^c|h(x)|2}  (14.38) 

in  a  region  near  x*,  A*.  If  x(A)  is  the  vector  minimizing  the  right-hand  side 
of  (14.38),  then  as  we  have  seen  in  Sect.  14.2,  h(x(T))  is  the  gradient  of  f.  Thus 
the  iterative  process 

Ak+i  =  Ak  +  ch(x(4)) 

used  in  the  basic  augmented  Lagrangian  method  is  seen  to  be  a  steepest  ascent 
iteration  for  maximizing  the  dual  function  (p.  It  is  a  simple  form  of  steepest  ascent, 
using  a  constant  stepsize  c. 

Although  the  stepsize  c  is  a  good  choice  (as  will  become  even  more  evident 
later),  it  is  clearly  advantageous  to  apply  the  algorithmic  principles  of  optimization 
developed  previously  by  selecting  the  stepsize  so  that  the  new  value  of  the  dual 
function  satisfies  an  ascent  criterion.  This  can  extend  the  range  of  convergence  of 
the  algorithm. 

The  rate  of  convergence  of  the  optimal  steepest  ascent  method  (where  the  stepsize 
is  selected  to  maximize  f  in  the  gradient  direction)  is  determined  by  the  eigenvalues 
of  the  Hessian  of  f.  The  Hessian  of  <p  is  found  from  (14.15)  to  be 

Vh(x(4))[L(x(4),  A)  +  cVh(x(/l))rVh(x(/l))]_1Vh(x)r.  (14.39) 

The  eigenvalues  of  this  matrix  at  the  solution  point  x*,  A *  determine  the  convergence 
rate  of  the  method  of  steepest  ascent. 

To  analyze  the  eigenvalues  we  make  use  of  the  matrix  identity 


cB(A  +  cBrB)_1Br  =  I  -  (I  +  cBA-1Br)-1, 
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which  is  a  generalization  of  the  Sherman-Morrison  formula.  (See  Sect.  10.4.)  It 
is  easily  seen  from  the  above  identity  that  the  matrices  B(A  +  cBrB)-1Br  and 
(BA-1Br  )  have  identical  eigenvectors.  One  way  to  see  this  is  to  multiply  both  sides 
of  the  identity  by  (I  +  cBA-1Br)  on  the  right  to  obtain 

cB(A  +  cBrB)-1Br(I  +  cBA-1Br)  =  cBA-1Br. 

Suppose  both  sides  are  applied  to  an  eigenvector  e  of  BA-1Br  having  eigenvalue  w. 
Then  we  obtain 

cB(A  +  cBrB)_1Br(l  +  cw)e  =  ewe. 

It  follows  that  e  is  also  an  eigenvector  of  B(A  +  cBrB)_1Br,  and  if  v  is  the  corre¬ 
sponding  eigenvalue,  the  relation 


cu(  1  +  cw )  =  cw 


must  hold.  Therefore,  the  eigenvalues  are  related  by 


w 

u  =  - . 

1  +  cw 


(14.40) 


The  above  relations  apply  directly  to  the  Hessian  (14.39)  through  the  associations 
A  =  L(x*,  A*)  and  B  =  Vh(x*).  Note  that  the  matrix  Vh(x*)L(x\  T*)_1Vh(x*)r, 
corresponding  to  BA_1Br  above,  is  the  Hessian  of  the  dual  function  of  the  original 
problem  (14.36).  As  shown  in  Sect.  14.3  the  eigenvalues  of  this  matrix  determine  the 
rate  of  convergence  for  the  ordinary  dual  method.  Let  w  and  W  be  the  smallest  and 
largest  eigenvalues  of  this  matrix.  From  (14.40)  it  follows  that  the  ratio  of  smallest 
to  largest  eigenvalues  of  the  Hessian  of  the  dual  for  the  augmented  problem  is 


—  +  c 

w 


—  +  C 

w 


This  shows  explicitly  how  the  rate  of  convergence  of  the  multiplier  method  depends 
on  c.  As  c  goes  to  infinity,  the  ratio  of  eigenvalues  goes  to  unity,  implying  arbitrarily 
fast  convergence. 

Other  unconstrained  optimization  techniques  may  be  applied  to  the  maximiza¬ 
tion  of  the  dual  function  defined  by  the  augmented  Lagrangian;  conjugate  gradient 
methods,  Newton’s  method,  and  quasi-Newton  methods  can  all  be  used.  The  use 
of  Newton’s  method  requires  evaluation  of  the  Hessian  matrix  (14.39).  For  some 
problems  this  may  be  feasible,  but  for  others  some  sort  of  approximation  is  desir¬ 
able.  One  approximation  is  obtained  by  noting  that  for  large  values  of  c,  the  Hes¬ 
sian  (14.39)  is  approximately  equal  to  (l/c)I.  Using  this  value  for  the  Hessian  and 
h(x(T))  for  the  gradient,  we  are  led  to  the  iterative  scheme 


4+  i  =  4  +  ch(x(4)), 

which  is  exactly  the  simple  method  of  multipliers  originally  proposed. 
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We  might  summarize  the  above  observations  by  the  following  statement  relating 
primal  and  dual  convergence  rates.  If  a  penalty  term  is  incorporated  into  a  problem, 
the  condition  number  of  the  primal  problem  becomes  increasingly  poor  as  c  — >  oo 
but  the  condition  number  of  the  dual  becomes  increasingly  good.  To  apply  the  dual 
method,  however,  an  unconstrained  penalty  problem  of  poor  condition  number  must 
be  solved  at  each  step. 


Inequality  Constraints 

The  advantage  of  augmented  Lagrangian  methods  is  mostly  in  dealing  with  equal¬ 
ities.  But  certain  inequality  constraints  can  be  easily  incorporated.  Let  us  consider 
the  problem  with  p  inequality  constraints: 

minimize  /(x) 

subject  to  g(x)  <  0.  (14.41) 

We  assume  that  this  problem  has  a  well-defined  solution  x\  which  is  a  regular 
point  of  the  constraints  and  which  satisfies  the  second-order  sufficiency  conditions 
for  a  local  minimum  as  specified  in  Sect.  11.8.  This  problem  can  be  written  as  an 
equivalent  problem  with  equality  constraints: 

minimize  /(x) 

subject  to  g(x)  +  u  =  0,  u  >  0.  (14.42) 

Through  this  conversion  we  can  hope  to  simply  apply  the  theory  for  equality  con¬ 
straints  to  problems  with  inequalities. 

In  order  to  do  so  we  must  insure  that  (14.42)  satisfies  the  second-order  suffi¬ 
ciency  conditions  of  Sect.  11.5.  These  conditions  will  not  hold  unless  we  impose  a 
strict  complementarity  assumption  that  gj(x*)  =  0  implies  p*  >  0  as  well  as  the 
usual  second-order  sufficiency  conditions  for  the  original  problem  (14.41).  (See  Ex¬ 
ercise  10.) 

With  these  assumptions  we  define  the  (partial)  dual  function  corresponding  to  the 
augmented  Lagrangian  method  as 

<p(n)  =  min  fix)  +  jur[g(x)  +  u]  +  L|g(x)  +  u|2.  (14.43) 

u>0,  X  2 

The  minimization  with  respect  to  u  in  (14.43)  can  be  carried  out  analytically,  and 
this  will  lead  to  a  definition  of  the  dual  function  that  only  involves  minimization  with 
respect  to  x.  The  variable  Uj  enters  the  objective  of  the  dual  function  only  through 
the  univariate  quadratic  expression 

1  7 

Pj  =  1-ijlgM)  +  Uj]  +  +  Uj]  . 


(14.44) 
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It  is  this  expression  that  we  must  minimize  with  respect  to  ^  o.  This  is  easily 
accomplished  by  differentiation:  If  Uj>  0,  the  derivative  must  vanish;  if  uj  =  0,  the 
derivative  must  be  nonnegative.  The  derivative  is  zero  at  zj  -  -gj(x)  -  /i  -/c.  Thus 
we  obtain  the  solution 


-gj(x)  if  -  gj(x)  -  ^  >  0 

0,  otherwise 


or  equivalently, 


(14.45) 


We  now  substitute  this  into  (14.44)  in  order  to  obtain  an  explicit  expression  for  the 
minimum  of  Pj. 

For  uj  -  0,  we  have 


1 

2c 

1 

2c 


(2 HjCgj{x)  +  c2gj(x)2 ) 

{[/Uj  +  Cgjix)]2 -/J2). 


For  uj  -  -gj(x)  -  jdj / c  we  have 


Pj  =  -y2j/'lc. 


These  can  be  combined  into  the  formula 


Pj  =  2  (|max(0,  /uj  +  cg/x)}]2  -  /u2) 


2c 

In  view  of  the  above,  let  us  define  the  function  of  two  scalar  arguments  t  and 


Pc(t,  n)  =  2-  ([max{0,  n  +  ct}\ 2  -  /. i 2) . 


(14.46) 


For  a  fixed  /i  >  0,  this  function  is  shown  in  Fig.  14.6.  Note  that  it  is  a  smooth 
function  with  derivative  with  respect  to  t  equal  to  ji  at  t  -  0. 

The  dual  function  for  the  inequality  problem  can  now  be  written  as 


(p(ji)  =  min 

X 


/(X)  +  ^  ^cte;(x),  Pj) 
V  7=1 


(14.47) 


Thus  inequality  problems  can  be  treated  by  adjoining  to  /(x)  a  special  penalty  func¬ 
tion  (that  depends  on  |ii).  The  Lagrange  multiplier  \i  can  then  be  adjusted  to  maxi¬ 
mize  0,  just  as  in  the  case  of  equality  constraints. 
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P 


Fig.  14.6  Penalty  function  for  inequality  problem 


14.7  The  Alternating  Direction  Method  of  Multipliers 

Consider  the  convex  minimization  model  with  linear  constraints  and  an  objective 
function  which  is  the  sum  of  two  separable  functions: 

minimize  +  /2(x2) 

subject  to  Aix1  +  A2X2  =  b,  (14.48) 

X1  £  Q\,  X2  £  £?2, 

where  At  £  Emxni  (i  =  1,2),  b  £  Em ,  Qt  c  EHi  (i  =  1,2)  are  closed  convex  sets; 
and  fi  :  Eni  — >  E  (i  =  1,2)  are  convex  functions  on  f2/,  respectively.  Then,  the 
augmented  Lagrangian  function  for  (14.48)  would  be 

lc(xl,x2,A)  =  /i(x')  +  /2(x2)  +  V^ix1  +  A2x2  -  b)  +  “(Aix1  +  A2x2  -  b|2. 

Throughout  this  section,  we  assume  problem  (14.48)  has  at  least  one  optimal  solu¬ 
tion. 

In  contrast  to  the  method  of  multipliers  in  the  last  section,  the  alternating  direc¬ 
tion  method  of  multipliers  (ADMM)  is  to  (approximately)  minimize  ^(x1,  x2,  A)  in 
an  alternative  order: 


=  arg  minxi eQl  lc(xl ,x2k,Ak\ 


=  arg  minx26£2  Zc(xJ+1, 
-  Ak  +  c(A\Xlk+x  +  A2x 


x2,Ak), 


/:+! 


-  b). 


(14.49) 


The  idea  is  that  each  of  the  smaller  minimization  problems  can  be  solved  more 
efficiently  or  even  in  close  forms  for  certain  cases. 
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Convergence  Speed  Analysis 

We  present  a  convergence  speed  analysis  of  the  ADMM.  For  simplicity,  we  shall  let 
Qi  be  EHi  and  f  be  differentiable  convex  functions  [the  result  is  also  valid  for  the 
ADMM  applied  to  the  aforementioned  more  general  problem  (14.48)].  Then,  any 
optimal  solution  and  multiplier  (xj,  x*,  4*)  satisfy 

V/i(x')r  +  A\C  =  0,  V/2(x:)7'  +  aU*  =  0,  AlXJ+A2x2-b  =  0,  (14.50) 

and  these  conditions  are  also  sufficient. 

We  first  establish  a  key  lemma. 

Lemma  1.  Let  dlk  =  A;(x^  -  xlJ,  i  =  1, 2,  and  d^1  =  /fi  -  4*;  and  {x£,  x^,  4*}  be  the  sequence 
generated  by  ADMM  (14.49).  Then,  it  holds  that 


Proof.  From  the  first-order  optimality  conditions  of  (14.49),  we  have 

'  V/i(x|+1)r  +  A[  [Ak  +  c(Aix|+1  +  A2x2k  -  b)]  =  0, 

<  Sf2(x2+l)T  +  AT2[Ak  +  c(Aix‘+1  +  A2x2k+l  -b)]  =  0,  (14.51) 

,  4+i  =  4  +  c(Aix*+1  +  A2x2k+l  -  b). 

Substituting  the  last  equation  into  other  equations  in  (14.51),  we  obtain 

'V/i(x‘+1)r  +  A[4+i  =  -cA\A2(x2k-x2k+l), 

<  V/2(X2+1)r  +  A[4+i  =  0,  (14.52) 

Axl+i  +  A2xLi  -  b  =  1(4+1  -  4). 

Moreover,  the  convexity  of  f,  i  =  1,2,  implies 

(V/i(x‘+1)  -  V/1(xJ))(x‘+1  -  xj)  >  0  and  (V/2(x7+1)  -  V/2(^))(^+1  -  xj)  >  0. 
On  the  other  hand,  from  (14.50)  and  (14.52), 

V/i(xh!)r  -  V/,(Xy  =  V/i(x‘+1)  +  A[4  =  — A[ di+1  -  cA[A2(X2  -  X2k+l) 
Sf2(x2k+l)T  -  Vf2(x2t)T  =  V/2(x7+1)  +  A[4  =  -A[d7+I 


°  =  AiX^i  +A2x2k+l  -b- 


~(Ak+ 1  -  Ak)  =  Aid 
c 


l 

k+ 1 


+  A2d^+1  -I — (Ak  -  Ak+ 1). 
c 


and 
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Thus, 


0  < 


( d1  ) 

uk+l 

7  / 

d/c+1 

u2 

d1 

VU/v+l 

V 

(  d1  ) 
ufc+l 

T  ( 

d^+t 

u2 

\uk+\ 

V 

( dl  ) 

u&+i 

rr 

rF+1 

u2 

VU/c+l  J 

V 

f  d1  ) 

uk+ 1 

rr 

d/c+l 

u2 

Vu/c+i  J 

V 

V/i(x‘+1)r-V/I(xl)r'l 
V/2(x2+1)r  -  V/2(x2)r 

0 

-A\Al+l-cA\A2(xl-xl+l)  ^ 

i 

Aid^+1  +  A2d?;+1  +  -  T^+i) 


A1  k+1 

— y47’d'*  4- 

A2a£+1  + 

Aid^+1  +  A2d^+1 
-cA^A2(x^  -  x^+1)^ 

0 

\(Ak  ~  Ak+ 1) 


Again  from  -Aid|+1  =  \{Ak+\  ~  Ak)  +  A2d^+1,  inequality  (14.53)  implies 


- cA^A2(x2k-x2k+1 

0 

\(Ak  -  Ak+ 1) 


(14.53) 


"AldLlf  M2(x*  -  X*+1) 

d£+ 1  /  l  ~~  ^+l) 


0  <  I  ~M*+ 1  -4)+A2d^+1 

d£+l 

A2d 


/CA2(x2-X2+1)\ 


df  ;■ )  dt-Xd) + <-<* -  -  <■> 


Since  V/2(x^)  =  -T£A2  holds  for  every  k  >  0,  it  follows  from  the  convexity  of  /2 
that 


(4  -  /U+i)rA2(x£  -  x2+1)  =  -(V/2(x^)  -  V/2(^+1))(x^  -  x2+1)  <  o. 


Thus, 


k+ 1 

v^+i 


2  'T/  V?A2(x2 


*+i  -  **> 


l  -  X) 


<  o. 


Representing  the  left  vector  by  u  and  the  right  one  by  v  in  the  last  inequality,  we 
have 


0  >  urv  =  -(|u|2  +  |v|2  -  |u  -  v|2). 


Noting 


/  VcA2  dt 

J-d 
V  VFU/ 


l2 

_ .  ,  k+1 

U“V  =  !  J_H/| 

Vc  *+ 1 


/  WA2(x2+1 -x2)^  /  V?A2d2 

'  * 
Vc  * 


l  )  =  \  tA 

we  obtain  the  desired  result  in  Lemma  1 . 1 


For  simplicity,  let  c  -  1  in  the  following.  Taking  the  sum  from  iterate  0  to  iterate 
k  for  the  inequality  in  Lemma  1,  we  obtain 

L  (|a2(x2+1  -  x2)|2  +  IX+i  -  A,\2)  <  | A2Xq  -  A2x2|2  +  Uo  -  XI2 . 

/=o  v 
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Thus,  we  have 


min 

0  <t<k 


-  X,)  +  Ho  -  A, 


Therefore,  from  (14.52)  we  have 

Theorem  1.  After  k  iterations  of  the  ADMM  method,  there  must  be  at  least  one  iterate 
0  <k<k  such  that 

|'V/1(x>,+  /+A[4fl' 

V/2(x?+1)r  +AjA-M 
{  A,x>+1  +  A2x?+1  -  b  , 

that  is,  (x)  ,  x?  ,  A~k+\)  has  its  optimality  condition  error  square  bounded  by  the  quantity 

/C  i  1  Hi  i  1 

on  the  right-hand  side  that  converges  0  arithmetically  as  k  — >  oo. 


< 


l  +  |Aj 


A2(Xq  -x2)|2  +  | T0  -  4 


The  Three  Block  Extension 


It  is  natural  to  consider  the  ADMM  method  for  solving  problems  with  more  than 
two  blocks: 

minimize  /Ax1)  +  /2(x2)  +  /3(x3) 

subject  to  Aix1  +  A2x2  +  A3x3  =  b,  (14.54) 

X1  €  Qi,  X2  €  f22,  X3  €  f23, 

where  A/  e  Emxn'  (/  =  1, 2, 3),  b  e  Em,  Qi  c  isnf  (/  =  1, 2, 3)  are  closed  convex  sets; 
and  fi  :  — >  E  (i  -  1, 2,  3)  are  convex  functions  on  Qj,  respectively.  With  the 

same  philosophy  as  the  ADMM  to  take  advantage  of  the  separable  structure,  one 
could  consider  the  procedure 


x{+1  :  =  arg  min  lc(xl  ,x2k,x3k,Ak), 

x1  eQ\ 

4+ 1  :  =  arg  "HP  «x,[+1,x2,x;U*), 

x2e£?2 

XL  :  =  ^  “in  4(xJ+1,x2+1,x3,4), 

x3  e£?3 

4+1  :  =  4  +  c(Aix^+1  +  A2x2+1  +  A3x3+1  -  b), 


where  the  augmented  Lagrangian  function 

3  3 

/Ax1 , x2, x3,  T)  =  ^  f(xl)  +  Tr(J]  AiX1  -  b)  + 

i= I  i=l 


(14.55) 


Unfortunately,  unlike  the  convergence  property  for  solving  two-block  problems, 
such  a  direct  extension  of  ADMM  not  converge  for  problems  with  three  blocks. 
Indeed,  consider  the  following  linear  homogeneous  equation  with  three  variables 
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(x1) 

'nr 

(x1) 

(Ai,  A2,  A3) 

X2 

x 3 

1 1 2 

[122J 

X2 

x 3 

J 

Let  c  -  1  and  each  block  contain  one  variable.  Then,  simple  calculation  will  show 
that  the  direct  extension  of  ADMM  (14.55)  is  divergent  from  any  point  in  a  subspace 
of  E 3.  Note  that  the  convergence  of  ADMM  (14.55)  applied  to  solving  the  linear 
equations  with  a  null  objective  is  independent  of  the  selection  of  the  penalty  par¬ 
ameter  c.  We  conclude: 

Theorem  2.  For  the  three-block  convex  minimization  problem  (14.54),  the  direct  extension 
of  ADMM  (14.55)  may  not  converge  for  any  penalty  parameter  c  >  0  starting  from  any 
point  in  a  subspace. 


*14.8  *  Cutting  Plane  Methods 

Cutting  plane  methods  are  applied  to  problems  having  the  general  form 

minimize  cTx 

subject  to  x  £  S,  (14.56) 

where  S  c  En  is  a  closed  convex  set.  Problems  that  involve  minimization  of  a 
convex  function  over  a  convex  set,  such  as  the  problem 

minimize  /( y) 

subject  to  y  e  R,  (14.57) 

where  R  c  En~[  is  a  convex  set  and  /  is  a  convex  function,  can  be  easily  converted 
to  the  form  (14.56)  by  writing  (14.57)  equivalently  as 

minimize  r 

subject  to  /( y)  -  r  <  0,  y  e  R  (14.58) 

which,  with  x  =  (r,  y)  e  En,  is  a  special  case  of  (14.56). 


General  Form  of  Algorithm 

The  general  form  of  a  cutting-plane  algorithm  for  problem  (14.56)  is  as  follows: 
Given  a  polytope  PkD  S 

Step  1.  Minimize  cTx  over  Pk  obtaining  a  point  xk  in  Pk.  If  x^  e  S ,  stop;  xk  is 
optimal.  Otherwise, 


14.8  *  Cutting  Plane  Methods 


459 


Step  2.  Find  a  hyperplane  Hk  separating  the  point  from  S ,  that  is,  find  e 
En ,  bk  c  E1  such  that  ^  c  jx  :  ajx  <  Z?*},  x^  e  {x  :  a^x  >  bk).  Update  Pk  to 
obtain  Pk+\  including  as  a  constraint  a ^  bk. 

The  process  is  illustrated  in  Fig.  14.7. 

Specific  algorithms  differ  mainly  in  the  manner  in  which  the  hyperplane  that 
separates  the  current  point  x^  from  the  constraint  set  S  is  selected.  This  selection  is, 
of  course,  the  most  important  aspect  of  the  algorithm,  since  it  is  the  deepness  of  the 
cut  associated  with  the  separating  hyperplane,  the  distance  of  the  hyperplane  from 
the  current  point,  that  governs  how  much  improvement  there  is  in  the  approximation 
to  the  constraint  set,  and  hence  how  fast  the  method  converges. 


Fig.  14.7  Cutting  plane  method 


Specific  algorithms  also  differ  somewhat  with  respect  to  the  manner  by  which 
the  polytope  is  updated  once  the  new  hyperplane  is  determined.  The  most  straight¬ 
forward  procedure  is  to  simply  adjoin  the  linear  inequality  associated  with  that  hyp¬ 
erplane  to  the  ones  determined  previously.  This  yields  the  best  possible  updated 
approximation  to  the  constraint  set  but  tends  to  produce,  after  a  large  number  of 
iterations,  an  unwieldy  number  of  inequalities  expressing  the  approximation.  Thus, 
in  some  algorithms,  older  inequalities  that  are  not  binding  at  the  current  point  are 
discarded  from  further  consideration. 

The  general  cutting  plane  algorithm  can  be  regarded  as  an  extended  application 
of  duality  in  linear  programming,  and  although  this  viewpoint  does  not  particularly 
aid  in  the  analysis  of  the  method,  it  reveals  the  basic  interconnection  between  cutting 
plane  and  dual  methods.  The  foundation  of  this  viewpoint  is  the  fact  that  S  can  be 
written  as  the  intersection  of  all  the  half-spaces  that  contain  it;  thus 

S  =  {x  :  afx  <  bu  i  e  /}, 
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where  I  is  an  (infinite)  index  set  corresponding  to  all  half-spaces  containing  S .  With 
S  viewed  in  this  way  problem  (14.56)  can  be  thought  of  as  an  (infinite)  linear  pro¬ 
gramming  problem. 

Corresponding  to  this  linear  program  there  is  (at  least  formally)  the  dual  problem 


maximize 


subject  to 


Ai  >  0 ,  ie  /. 


(14.59) 


Selecting  a  finite  subset  of  /,  say  /,  and  forming 

P  =  {x  :  afx  <  bi ,  i  e  7} 

gives  a  polytope  that  contains  S .  Minimizing  crx  over  this  polytope  yields  a  point 
and  a  corresponding  subset  of  active  constraints  I  a  .  The  dual  problem  with  the  addi¬ 
tional  restriction  4,  =  0  for  i  £  I  a  will  then  have  a  feasible  solution,  but  this  solution 
will  in  general  not  be  optimal.  Thus,  a  solution  to  a  polytope  problem  corresponds 
to  a  feasible  but  non-optimal  solution  to  the  dual.  For  this  reason  the  cutting  plane 
method  can  be  regarded  as  working  toward  optimality  of  the  (infinite  dimensional) 
dual. 


Kelley’s  Convex  Cutting  Plane  Algorithm 

The  convex  cutting  plane  method  was  developed  to  solve  convex  programming 
problems  of  the  form 

minimize  /(x)  (14.60) 

subject  to  £;(*)<  0,  i  =  1,2,  p, 

where  x  e  En  and  /  and  the  gf  s  are  differentiable  convex  functions.  As  indicated  in 
the  last  section,  it  is  sufficient  to  consider  the  case  where  the  objective  function  is 
linear;  thus,  we  consider  the  problem 

minimize  cTx  (14.61) 

subject  to  g(x)  <  0 

where  x  e  En  and  g(x)  e  Ep  is  convex  and  differentiable. 

For  g  convex  and  differentiable  we  have  the  fundamental  inequality 

g(x)  >  g(w)  +  Vg(w)(x  -  w)  (14.62) 

for  any  x,  w.  We  use  this  equation  to  determine  the  separating  hyperplane.  Specifi¬ 
cally,  the  algorithm  is  as  follows: 
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Let  S  =  {x  :  g(x)  <  0}  and  let  P  be  an  initial  poly  tope  containing  S  and  such  that 
cTx  is  bounded  on  P.  Then 

Step  1.  Minimize  cTx  over  P  obtaining  the  point  x  =  w.  If  g(w)  <  0,  stop;  w  is  an 
optimal  solution.  Otherwise, 

Step  2.  Let  i  be  an  index  maximizing  g;(w).  Clearly  g/(w)  >  0.  Define  the  new 
approximating  poly  tope  to  be  the  old  one  intersected  with  the  half- space 

{x  :  gi( w)  +  Vg/(w)(x  -  w)  <  0}.  (14.63) 


Return  to  Step  1 . 

The  set  defined  by  (14.63)  is  actually  a  half-space  if  Vg*( w)  ^  0.  However, 
Vg/( w)  =  0  would  imply  that  w  minimizes  gi  which  is  impossible  if  S  is  nonempty. 
Furthermore,  the  half-space  given  by  (14.63)  contains  S ,  since  if  g(x)  <  0  then 
by  (14.62)  gf( w)  +  Vg/(w)(x  -  w)  <  gt(x)  <  0.  The  half-space  does  not  contain 
the  point  w  since  g/(w)  >  0.  This  method  for  selecting  the  separating  hyperplane  is 
illustrated  in  Fig.  14.8  for  the  one-dimensional  case.  Note  that  in  one  dimension,  the 
procedure  reduces  to  Newton’s  method. 


Fig.  14.8  Convex  cutting  plane 


Calculation  of  the  separating  hyperplane  is  exceedingly  simple  in  this  algorithm, 
and  hence  the  method  really  amounts  to  the  solution  of  a  series  of  linear  program¬ 
ming  problems.  It  should  be  noted  that  this  algorithm,  valid  for  any  convex  program¬ 
ming  problem,  does  not  involve  any  line  searches.  In  that  respect  it  is  also  similar  to 
Newton’s  method  applied  to  a  convex  function. 


Convergence 

Under  fairly  mild  assumptions  on  the  convex  function,  the  convex  cutting  plane 
method  is  globally  convergent.  It  is  possible  to  apply  the  general  convergence 
theorem  to  prove  this,  but  somewhat  easier,  in  this  case,  to  prove  it  directly. 
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Theorem.  Let  the  convex  functions  gi,  i  =  1,2,  . . . ,  p  be  continuously  differentiable,  and 
suppose  the  convex  cutting  plane  algorithm  generates  the  sequence  of  points  {w/J.  Any  limit 
point  of  this  sequence  is  a  solution  to  problem  (14.61). 

Proof.  Suppose  {w^},  k  <e  %  is  a  subsequence  of  {w^}  converging  to  w.  By  taking 
a  further  subsequence  of  this,  if  necessary,  we  may  assume  that  the  index  i  corre¬ 
sponding  to  Step  2  of  the  algorithm  is  fixed  throughout  the  subsequence.  Now  if 
keK,  k'  eK  and  k'  >  k ,  then  we  must  have 

gi(yfk)  +  Vgi(w*)(w*-  -  w k)  <  o, 


which  implies  that 

g/(w*)  <  |Vg/(w*)||w*/  -  w*|.  (14.64) 

Since  |Vg/(w^)|  is  bounded  with  respect  to  k  e  7C,  the  right-hand  side  of  (14.64)  goes 
to  zero  as  k  and  k'  go  to  infinity.  The  left-hand  side  goes  to  g/(w).  Thus  gf  w)  <  0 
and  we  see  that  w  is  feasible  for  problem  (14.61). 

If  /*  is  the  optimal  value  of  problem  (14.61),  we  have  crw^  <  /*  for  each  k  since 
w;t  is  obtained  by  minimizing  over  a  set  containing  S .  Thus,  by  continuity,  crw  <  /* 
and  hence  w  is  an  optimal  solution.  I 

As  with  most  algorithms  based  on  linear  programming  concepts,  the  rate  of  con¬ 
vergence  of  cutting  plane  algorithms  has  not  yet  been  satisfactorily  analyzed.  Pre¬ 
liminary  research  shows  that  these  algorithms  converge  arithmetically,  that  is,  if  x* 
is  optimal,  then  |x^  -  x*|2  <  c/k  for  some  constant  c.  This  is  an  exceedingly  poor 
type  of  convergence.  This  estimate,  however,  may  not  be  the  best  possible  and  in¬ 
deed  there  are  indications  that  the  convergence  is  actually  geometric  but  with  a  ratio 
that  goes  to  unity  as  the  dimension  of  the  problem  increases. 


Modifications 

We  now  describe  the  supporting  hyperplane  algorithm  (an  alternative  method  for 
determining  a  cutting  plane)  and  examine  the  possibility  of  dropping  from  consid¬ 
eration  some  old  hyperplanes  so  that  the  linear  programs  do  not  grow  too  large. 
The  convexity  requirements  are  less  severe  for  this  algorithm.  It  is  applicable  to 
problems  of  the  form 

minimize  cTx 
subject  to  g(x)  <  0, 

where  x  e  En,  g(x)  e  Ep ,  the  gf  s  are  continuously  differentiable,  and  the  constraint 
region  S  defined  by  the  inequalities  is  convex.  Note  that  convexity  of  the  functions 
themselves  is  not  required.  We  also  assume  the  existence  of  a  point  interior  to  the 
constraint  region,  that  is,  we  assume  the  existence  of  a  point  y  such  that  g(y)  <  0, 
and  we  assume  that  on  the  constraint  boundary  gfx)  =  0  implies  Vg/(x)  ^  0.  The 
algorithm  is  as  follows: 


14.8  *  Cutting  Plane  Methods 
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Start  with  an  initial  poly  tope  P  containing  S  and  such  that  cTx  is  bounded  below 
on  S .  Then 

Step  1.  Determine  w  =  x  to  minimize  cTx  over  P.  If  w  e  S ,  stop.  Otherwise, 

Step  2.  Find  the  point  u  on  the  line  joining  y  and  w  that  lies  on  the  boundary 

of  S .  Let  i  be  an  index  for  which  g;(u)  =  0  and  define  the  half-space  H  -  {x: 
Vg,<U)(X  -  U)  <  0}.  Update  P  by  intersecting  with  H.  Return  to  Step  1. 

The  algorithm  is  illustrated  in  Fig.  14.9. 

The  price  paid  for  the  generality  of  this  method  over  the  convex  cutting  plane 
method  is  that  an  interpolation  along  the  line  joining  y  and  w  must  be  executed  to 
find  the  point  u.  This  is  analogous  to  the  line  search  for  a  minimum  point  required 
by  most  programming  algorithms. 


Fig.  14.9  Supporting  hyperplane  algorithm 


Dropping  Nonbinding  Constraints 

In  all  cutting  plane  algorithms  nonbinding  constraints  can  be  dropped  from  the  app¬ 
roximating  set  of  linear  inequalities  so  as  to  keep  the  complexity  of  the  approx¬ 
imation  manageable.  Indeed,  since  n  linearly  independent  hyperplanes  determine 
a  single  point  in  En,  the  algorithm  can  be  arranged,  by  discarding  the  nonbinding 
constraints  at  the  end  of  each  step,  so  that  the  polytope  consists  of  exactly  n  linear 
inequalities  at  every  stage. 

Global  convergence  is  not  destroyed  by  this  process,  since  the  sequence  of  obj¬ 
ective  values  will  still  be  monotonically  increasing.  It  is  not  known,  however,  what 
effect  this  has  on  the  speed  of  convergence. 
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14.9  Exercises 

1.  (Linear  programming)  Use  the  global  duality  theorem  to  find  the  dual  of  the 
linear  program 

rV 

minimize  c  x 
subject  to  Ax  =  b,  x  >  0. 

Note  that  some  of  the  regularity  conditions  may  not  be  necessary  for  the  linear 
case. 

2.  (Double  dual)  Show  that  the  for  a  convex  programming  problem  with  a  solution, 
the  dual  of  the  dual  is  in  some  sense  the  original  problem. 

3.  (Non-convex?)  Consider  the  problem 

minimize  xy 
subject  to  x  +  y  -  4  >  0 

1  <  x  <5,  1  <  y  <  5. 

Show  that  although  the  objective  function  is  not  convex,  the  primal  function  is 
convex.  Find  the  optimal  value  and  the  Lagrange  multiplier. 

4.  Find  the  global  maximum  of  the  dual  function  of  Example  1,  Sect.  14.2. 

5.  Show  that  the  function  0  defined  for  A,  p,  (p  >  0),  by  0(4,  p)  =  minx[/(x)  + 
4rh(x)  +  prg(x)]  is  concave  over  any  convex  region  where  it  is  finite. 

6.  Prove  that  the  dual  canonical  rate  of  convergence  is  not  affected  by  a  change  of 
variables  in  x. 

7.  Corresponding  to  the  dual  function  (14.23): 

(a)  Find  its  gradient. 

(b)  Find  its  Hessian. 

(c)  Verify  that  it  has  a  local  maximum  at  A* ,  p* . 

8.  Find  the  Hessian  of  the  dual  function  for  a  separable  problem. 

9.  Find  an  explicit  formula  for  the  dual  function  for  the  entropy  problem  (Exam¬ 
ple  3,  Sect.  11.4). 

10.  Consider  the  problems 

minimize  /(x)  (14.65) 

subject  to  gi(x)  <  0,  j=  1,2,  p 


and 


minimize  /(x)  (14.66) 

subject  to  g,(x)  +  z2j  =  0,  j  =  1,2,  . . . ,  p. 

(a)  Let  x*,  ji *,  p*,  . . .,  /j*  be  a  point  and  set  of  Lagrange  multipliers  that 
satisfy  the  first-order  necessary  conditions  for  (14.65).  For  x*,  p*,  write  the 
second-order  sufficiency  conditions  for  (14.66). 
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(b)  Show  that  in  general  they  are  not  satisfied  unless,  in  addition  to  satisfying 
the  sufficiency  conditions  of  Sect.  11.8,  gj(x*)  implies  //*.  >  0. 

1 1 .  Establish  global  convergence  for  the  supporting  hyperplane  algorithm. 

12.  Establish  global  convergence  for  an  imperfect  version  of  the  supporting  hyper¬ 
plane  algorithm  that  in  interpolating  to  find  the  boundary  point  u  actually  finds 
a  point  somewhere  on  the  segment  joining  u  and  +  |w  and  establishes  a 
hyperplane  there. 

13.  Prove  that  the  convex  cutting  plane  method  is  still  globally  convergent  if  it  is 
modified  by  discarding  from  the  definition  of  the  polytope  at  each  stage  hyper¬ 
planes  corresponding  to  inactive  linear  inequalities. 
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Chapter  15 

Primal-Dual  Methods 


This  chapter  discusses  methods  that  work  simultaneously  with  primal  and  dual 
variables,  in  essence  seeking  to  satisfy  the  first-order  necessary  conditions  for  opti¬ 
mality.  The  methods  employ  many  of  the  concepts  used  in  earlier  chapters,  including 
those  related  to  active  set  methods,  various  first  and  second  order  methods,  penalty 
methods,  and  barrier  methods.  Indeed,  a  study  of  this  chapter  is  in  a  sense  a  review 
and  extension  of  what  has  been  presented  earlier. 

The  first  several  sections  of  the  chapter  discuss  methods  for  solving  the  standard 
nonlinear  programming  structure  that  has  been  treated  in  the  Parts  II  and  III  of  the 
text.  These  sections  provide  alternatives  to  the  methods  discussed  earlier. 


15.1  The  Standard  Problem 

Consider  again  the  standard  nonlinear  program 

minimize  /(x)  (15.1) 

subject  to  h(x)  =  0,  g(x)  <  0. 

Together  with  the  feasibility,  the  first-order  necessary  conditions  for  optimality  are, 
as  we  know, 

V/(x)  +  ArVh(x)  +  |urVg(x)  =  0  (15.2) 

\x  >  0 

|Vg(x)  =  o 

The  last  requirement  is  the  complementary  slackness  condition.  If  it  is  known  which 
of  the  inequality  constraints  is  active  at  the  solution,  these  active  constraints  can  be 
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rolled  into  the  equality  constraints  h(x)  =  0,  and  the  inactive  inequalities  along  with 
the  complementary  slackness  condition  dropped,  to  obtain  a  problem  with  equality 
constraints  only.  This  indeed  is  the  structure  of  the  problem  near  the  solution. 

If  in  this  structure  the  vector  x  is  ^-dimensional  and  h  is  ra-dimensional,  then  A 
will  also  be  m- dimensional.  The  system  (15.1)  will,  in  this  reduced  form,  consist  of 
n  +  m  equations  and  n  +  m  unknowns,  which  is  an  indication  that  the  system  may 
be  well  defined,  and  hence  that  there  is  a  solution  for  the  pair  (x,  A).  In  essence, 
primal-dual  methods  amount  to  solving  this  system  of  equations,  and  use  additional 
strategies  to  account  for  inequality  constraints. 

In  view  of  the  above  observation  it  is  natural  to  consider  whether  in  fact  the 
system  of  necessary  conditions  is  in  fact  well  conditioned,  possessing  a  unique 
solution  (x,  A).  We  investigate  this  question  by  considering  a  linearized  version 
of  the  conditions. 

A  useful  and  somewhat  more  generally  useful  approach  is  to  consider  the 
quadratic  program 


minimize 
subject  to 


l  rr i  rr i 

-x  Ox  +  c  x 

2 

Ax  =  b, 


(15.3) 


where  x  is  ^-dimensional  and  b  is  m- dimensional. 

The  first-order  conditions  for  this  problem  are 

Qx  +  ArA  +  c  =  0  (15.4) 

Ax  -  b  =  0. 


These  correspond  to  the  necessary  conditions  (15.2)  for  equality  constraints  only. 
The  following  proposition  gives  conditions  under  which  the  system  is  nonsingular. 


Proposition.  Let  Q  and  A  be  n  x  n  and  mx  n  matrices,  respectively.  Suppose  that  A  has 
rank  m  and  that  Q  is  positive  definite  on  the  subspace  M  -  {x  :  Ax  =  0}.  Then  the  matrix 


Q  Ar 
A  0 


(15.5) 


is  nonsingular. 


Proof.  Suppose  (x,  y)  e  En+m  is  such  that 

Qx  +  Ary  =  0 

Ax  =  0.  (15.6) 

Multiplication  of  the  first  equation  by  xT  yields 

xT  Qx  +  xT  AT  y  =  0, 

and  substitution  of  Ax  =  0  yields  xrQx  =  0.  However,  clearly  x  <g  M,  and  thus  the 
hypothesis  on  Q  together  with  xrQx  =  0  implies  that  x  =  0.  It  then  follows  from  the 
first  equation  that  Ary  =  0.  The  full-rank  condition  on  A  then  implies  that  y  =  0. 
Thus  the  only  solution  to  (15.6)  is  x  =  0,  y  =  0. 1 


15.1  The  Standard  Problem 
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If,  as  is  often  the  case,  the  matrix  Q  is  actually  positive  definite  (over  the  whole 
space),  then  an  explicit  formula  for  the  solution  of  the  system  can  be  easily  derived 
as  follows:  From  the  first  equation  in  (15.4)  we  have 

x  =  -Q_1ArA  -  Q_1c. 

Substitution  of  this  into  the  second  equation  then  yields 

-AQ-1ArA  -  AQ-1c  -  b  =  0, 

from  which  we  immediately  obtain 

A  =  -(AQ“1Arr1[AQ-1c  +  b]  (15.7) 

and 

x  =  Q_1A7'(AQ_1A7'r1[AQ"1c  +  b]  -  Q"'c 
=  -Q-1[I  -  A7'(AQ-1A7'r1AQ"1]c  (15.8) 

+  Q-1Ar(AQ-1Ar)-1b. 


Strategies 

There  are  some  general  strategies  that  guide  the  development  of  the  primal-dual 
methods  of  this  chapter. 

1.  Descent  Measures.  A  fundamental  concept  that  we  have  frequently  used  is 
that  of  assuring  that  progress  is  made  at  each  step  of  an  iterative  algorithm. 
It  is  this  that  is  used  to  guarantee  global  convergence.  In  primal  methods  this 
measure  of  descent  is  the  objective  function.  Even  the  simplex  method  of  linear 
programming  is  founded  on  this  idea  of  making  progress  with  respect  to  the 
objective  function.  For  primal  minimization  methods,  one  typically  arranges 
that  the  objective  function  decreases  at  each  step. 

The  objective  function  is  not  the  only  possible  way  to  measure  progress.  We 
have,  for  example,  when  minimizing  a  function  /,  considered  the  quantity 
(l/2)|V/(x)|2,  seeking  to  monotonically  reduce  it  to  zero. 

In  general,  a  function  used  to  measure  progress  is  termed  a  merit  function. 
Typically,  it  is  defined  so  as  to  decrease  as  progress  is  made  toward  the  solution 
of  a  minimization  problem,  but  the  sign  may  be  reversed  in  some  definitions. 
For  primal-dual  methods,  the  merit  function  may  depend  on  both  x  and  A.  One 
especially  useful  merit  function  for  equality  constrained  problems  is 

m(x,  A)  =  l|V/(x)  +  ArVh(x)|2  +  ||h(x))|2. 

It  is  examined  in  the  next  section. 
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We  shall  examine  other  merit  functions  later  in  the  chapter.  With  interior  point 
methods  or  semidefinite  programming,  we  shall  use  a  potential  function  that 
serves  as  a  merit  function. 

2.  Active  Set  Methods.  Inequality  constraints  can  be  treated  using  active  set 
methods  that  treat  the  active  constraints  as  equality  constraints,  at  least  for  the 
current  iteration.  However,  in  primal-dual  methods,  both  x  and  A  are  changed. 
We  shall  consider  variations  of  steepest  descent,  conjugate  directions,  and 
Newton’s  method  where  movement  is  made  in  the  (x,  A)  space. 

3.  Penalty  Functions.  In  some  primal-dual  methods,  a  penalty  function  can  serve 
as  a  merit  function,  even  though  the  penalty  function  depends  only  on  x.  This 
is  particularly  attractive  for  recursive  quadratic  programming  methods  where  a 
quadratic  program  is  solved  at  each  stage  to  determine  the  direction  of  change 
in  the  pair  (x,  A). 

4.  Interior  (Barrier)  Methods.  Barrier  methods  lead  to  methods  that  move  within 
the  relative  interior  of  the  inequality  constraints.  This  approach  leads  to  the 
concept  of  the  primal-dual  central  path.  These  methods  are  used  for  semidefinite 
programming  since  these  problems  are  characterized  as  possessing  a  special 
form  of  inequality  constraint. 


15.2  A  Simple  Merit  Function 

It  is  very  natural,  when  considering  the  system  of  necessary  conditions  (15.2),  to 
form  the  function 


mix,  A)  =  l|V/(x)  +  ArVh(x)|2  +  i|h(x)|2,  (15.9) 

and  use  it  as  a  measure  of  how  close  a  point  (x,  A)  is  to  a  solution. 

It  must  be  noted,  however,  that  the  function  m(x,  A)  is  not  always  well-behaved; 
it  may  have  local  minima,  and  these  are  of  no  value  in  a  search  for  a  solution.  The 
following  theorem  gives  the  conditions  under  which  the  function  m(x,  A)  can  serve 
as  a  well-behaved  merit  function.  Basically,  the  main  requirement  is  that  the  Hessian 
of  the  Lagrangian  be  positive  definite.  As  usual,  we  define  /(x,  A)  =  fix)  +  A  h(x). 

Theorem.  Let  f  and  h  be  twice  continuously  differentiable  functions  on  E"  of  dimension  1 
and  m,  respectively.  Suppose  thatx*  and  A*  satisfy  the  first-order  necessary  conditions  for  a 

i  y  2  1  2 

local  minimum  of  mix.  A)  =  fiVf^+A  Vh(x)|  ~  +  2  |h(x)|  with  respect  to  x  and  A.  Suppose 
also  that  at  x*,  A*,  (i)  the  rank  of  Vh(x*)  is  m  and  (ii)  the  Hessian  matrix  L(x*,  A*)  = 
F(x*)  +  A*rH(x*)  is  positive  definite.  Then,  x*,  A*  is  a  ( possibly  nonunique )  global  minimum 
point  of  mix,  A),  with  value  mix*,  A*)  =  0. 

Proof  Since  x*,  A*  satisfies  the  first-order  conditions  for  a  local  minimum  point  of 
mix ,  A),  we  have 


[V/(x*)  +  A*rVh(x*)]L(x*,  A*)  +  h(x*)rVh(x*)  =  0 

[V/(x*)  +  A'7  Vh(x*  )]Vh(x*  )T  =  0. 


(15.10) 

(15.11) 
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Multiplying  (15.10)  on  the  right  by  [V/(x*)  +  A*rVh(x*)]r  and  using  (15.11)  we 
obtain1 

V/(x*,  A*)L(x*,  A*)V/(x*,  A*)r  =  0. 

Since  L(x*,  A*)  is  positive  definite,  this  implies  that  V/(x*,  A*)  =  0.  Using  this 
in  (15.10),  we  find  that  h(x*)rVh(x*)  =  0,  which,  since  Vh(x*)  is  of  rank  m,  implies 
that  h(x*)  =  0. 1 

The  requirement  that  the  Hessian  of  the  Lagrangian  L(x*,  A*)  be  positive  defi¬ 
nite  at  a  stationary  point  of  the  merit  function  m  is  actually  not  too  restrictive.  This 
condition  will  be  satisfied  in  the  case  of  a  convex  programming  problem  where  /  is 
strictly  convex  and  h  is  linear.  Furthermore,  even  in  nonconvex  problems  one  can 
often  arrange  for  this  condition  to  hold,  at  least  near  a  solution  to  the  original  con¬ 
strained  minimization  problem.  If  it  is  assumed  that  the  second-order  sufficiency 
conditions  for  a  constrained  minimum  hold  at  x*,  A*,  then  L(x*,  A*)  is  positive 
definite  on  the  subspace  that  defines  the  tangent  to  the  constraints;  that  is,  on  the 
subspace  defined  by  Vh(x*)x  =  0.  Now  if  the  original  problem  is  modified  with  a 
penalty  term  to  the  problem 


1  9 

minimize  /(x)  + -c  |h(x)|2  (15.12) 

2' 

subject  to  h(x)  =  0, 

the  solution  point  x*  will  be  unchanged.  However,  as  discussed  in  Chap.  14,  the 
Hessian  of  the  Lagrangian  of  this  new  problem  (15.12)  at  the  solution  point  is 
L(x\  A*)  +  cVh(x*)rVh(x*  ).  For  sufficiently  large  c,  this  matrix  will  be  positive 
definite.  Thus  a  problem  can  be  “convexified”  (at  least  locally)  before  the  merit 
function  method  is  employed. 

An  extension  to  problems  with  inequality  constraints  can  be  defined  by  partition¬ 
ing  the  constraints  into  the  two  groups  active  and  inactive.  However,  at  this  point 
the  simple  merit  function  for  problems  with  equality  constraints  is  adequate  for  the 
purpose  of  illustrating  the  general  idea. 


15.3  Basic  Primal-Dual  Methods 

Many  primal-dual  methods  are  patterned  after  some  of  the  methods  used  in  ear¬ 
lier  chapters,  except  of  course  that  the  emphasis  is  on  equation  solving  rather  than 
explicit  optimization. 


1  Unless  explicitly  indicated  to  the  contrary,  the  notation  V/(x,  A)  refers  to  the  gradient  of  /  with 

respect  to  x,  that  is,  Vx/(x,  A). 
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First-Order  Method 

We  consider  first  a  simple  straightforward  approach,  which  in  a  sense  parallels  the 
idea  of  steepest  descent  in  that  it  uses  only  a  first-order  approximation  to  the  primal- 
dual  equations.  It  is  defined  by 

Xfc+1  =  xk  -  akVl(xk,  \)T  (15.13) 

A/c+ 1  —  4" 

where  is  not  yet  determined.  This  is  based  on  the  error  in  satisfying  (15.2).  As¬ 
sume  that  the  Hessian  of  the  Lagrangian  L(x,  A)  is  positive  definite  in  some  compact 
region  of  interest,  and  consider  the  simple  merit  function 

mix,  A)  =  i|V/(x,  A)|2  +  ||h(x)|2  (15.14) 

discussed  above.  We  would  like  to  determine  whether  the  direction  of  change 
in  (15.13)  is  a  descent  direction  with  respect  to  this  merit  function.  The  gradient 
of  the  merit  function  has  components  corresponding  to  x  and  A  of 

V/(x,  A)L(x,  A)  +  h(x)7Vh(x)  (15.15) 

V/(x,  A)Vh(x)7. 

Thus  the  inner  product  of  this  gradient  with  the  direction  vector  having  components 
-V/(x,  A)r,  h(x)  is 

-V/(x,  A)L(x,  A )V/(x,  A)7-  -  h(x)7Vh(x)VZ(x,  A)7  +  VZ(x,  A)Vh(x)7h(x) 

=  -V/(x,  A)L(x,  A)VZ(x,  A)7  <  0. 

This  shows  that  the  search  direction  is  in  fact  a  descent  direction  for  the  merit 
function,  unless  V/(x,  A)  =  0.  Thus  by  selecting  to  minimize  the  merit  func¬ 
tion  in  the  search  direction  at  each  step,  the  process  will  converge  to  a  point  where 
V/(x,  A)  =  0.  However,  there  is  no  guarantee  that  h(x)  =  0  at  that  point. 

We  can  try  to  improve  the  method  either  by  changing  the  way  in  which  the  direc¬ 
tion  is  selected  or  by  changing  the  merit  function.  In  this  case  a  slight  modification 
of  the  merit  function  will  work.  Let 

w(x.  A,  y)  =  m(x.  A)  -  y[/(x)  +  A7h(x)] 

for  some  y  >  0.  We  then  calculate  that  the  gradient  of  w  has  the  two  components 
corresponding  to  x  and  A 

V/(x,  A)L(x,  A)  +  h(x)rVh(x)  -  yV/(x,  A) 

V/(x,  A)Vh(x)r  -  yh(x)r, 


and  hence  the  inner  product  of  the  gradient  with  the  direction  -V/(x,  A)r,  h(x)  is 

-V/(x,  A)[L(x,  A)  -  yI]V/(x,  A)r  -  y|h(x)|2. 
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Now  since  we  are  assuming  that  L(x,  A)  is  positive  definite  in  a  compact  region  of 
interest,  there  is  a  y  >  0  such  that  L(x,  A)  -  yl  is  positive  definite  in  this  region. 
Then  according  to  the  above  calculation,  the  direction  -V/(x,  A)r,  h(x)  is  a  descent 
direction,  and  the  standard  descent  method  will  converge  to  a  solution.  This  method 
will  not  converge  very  rapidly  however.  (See  Exercise  2  for  further  analysis  of  this 
method.) 


Conjugate  Directions 


One  may  also  use  the  conjugate  direction.  Let  us  consider  the  quadratic  program 

1 


minimize  -x  Qx  -  b  x 
subject  to  Ax  =  c. 

The  first-order  necessary  conditions  for  this  problem  are 

Qx  +  ArA  =  b 
Ax  =  c. 


(15.16) 


(15.17) 


As  discussed  in  the  previous  section,  this  problem  is  equivalent  to  solving  a  system 
of  linear  equations  whose  coefficient  matrix  is 


M  = 


Ar 

0 


(15.18) 


This  matrix  is  symmetric,  but  it  is  not  positive  definite  (nor  even  semidefinite).  How¬ 
ever,  it  is  possible  to  formally  generalize  the  conjugate  gradient  method  to  systems 
of  this  type  by  just  applying  the  conjugate-gradient  formulae  (15. 17)— (15.20)  of 
Sect.  9.3  with  Q  replaced  by  M.  A  difficulty  is  that  singular  directions  (defined  as 
directions  p  such  that  prMp  =  0)  may  occur  and  cause  the  process  to  break  down. 
Procedures  for  overcoming  this  difficulty  have  been  developed,  however.  Also,  as 
in  the  ordinary  conjugate  gradient  method,  the  approach  can  be  generalized  to  treat 
nonquadratic  problems  as  well.  Overall,  however,  the  application  of  conjugate  di¬ 
rection  methods  to  the  Lagrange  system  of  equations,  although  very  promising,  is 
not  currently  considered  practical. 


Second-Order  Method:  Newton’s  Method 

Newton’s  method  for  solving  systems  of  equations  can  be  easily  applied  to  the 
Lagrange  equations.  In  its  most  straightforward  form,  the  method  solves  the  system 


V/(x,  A)  =  0 
h(x)  =  0 


(15.19) 
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by  solving  the  linearized  version  recursively.  That  is,  given  x^,  A^  the  new  point 
Xfc+i,  A^+i  is  determined  from  the  equations 

VZ(X*.  A,)7'  +  L(X,,  A,)d,  +  Vh(x*)ry*  =  0  (15.20) 

h(xk)  +  Vh(xk)dk  =  0 


by  setting  x^+i  =  xk  +  d<  ,  A*+i  =  Xk  +  yk.  In  matrix  form  the  above  Newton  equa- 
tions  are 


L(Xi,  \k)  Vh(xk)T 

d, 

-Vl(xk,  \k)T ' 

Vh(x,)  0 

y  k 

-h  (xk) 

(15.21) 


The  Newton  equations  have  some  important  structural  properties.  First,  we  observe 
that  by  adding  Vh(xk)T\  to  the  top  equation,  the  system  can  be  transformed  to  the 
form 


'U\k,  A,)  Vh(x,)r 

d, 

'  -V/(x,)r 

Vh(x*)  0 

Ayfc+l 

-h  (xk) 

(15.22) 


where  again  A^+i  =  A^  +  y^.  In  this  form  \  appears  only  in  the  matrix  L(x^,  \). 
This  conversion  between  (15.21)  and  (15.22)  will  be  useful  later. 

Next  we  note  that  the  structure  of  the  coefficient  matrix  of  (15.21)  or  (15.22)  is 
identical  to  that  of  the  Proposition  of  Sect.  15.1.  The  standard  second-order  suffi¬ 
ciency  conditions  imply  that  Vh(x*)  is  of  full  rank  and  that  L(x\  A*)  is  positive 
definite  on  M  =  {x  :  Vh(x*)x  =  0}  at  the  solution.  By  continuity  these  conditions 
can  be  assumed  to  hold  in  a  region  near  the  solution  as  well.  Under  these  assump¬ 
tions  it  follows  from  Proposition  1  that  the  Newton  equation  (15.21)  has  a  unique 
solution. 

It  is  again  worthwhile  to  point  out  that,  although  the  Hessian  of  the  Lagrangian 
need  be  positive  definite  only  on  the  tangent  subspace  in  order  for  the  system  (15.21) 
to  be  nonsingular,  it  is  possible  to  alter  the  original  problem  by  incorporation  of 
a  quadratic  penalty  term  so  that  the  new  Hessian  of  the  Lagrangian  is  L(x,  A)  + 
cVh(x)rVh(x).  For  sufficiently  large  c,  this  new  Hessian  will  be  positive  definite 
over  the  entire  space. 

If  L(x,  A)  is  positive  definite  (either  originally  or  through  the  incorporation  of 
a  penalty  term),  it  is  possible  to  write  an  explicit  expression  for  the  solution  of  the 
system  (15.21).  Let  us  define  Lk  =  L(xk,  \k),  Ak  =  Vh(x^),  1^  =  Vl(xk,  \k)T,  h*  = 
h(x^).  The  system  then  takes  the  form 


L*d*  +  A[y*  =  -1*  (15.23) 

—  —  h^. 

The  solution  is  readily  found,  as  in  (15.7)  and  (15.8)  for  quadratic  programming, 
to  be 


y*  =  (A*LJ‘A Tkrl[hk  -  Akl,-k%]  (15.24) 

d*  =  -L-'[I  -  ATk{Ak\Tkx  ATkZ  Ak\rkx]\k  - L?ATk(AkL?ATkrlhk.  (15.25) 


There  are  standard  results  concerning  Newton’s  method  applied  to  a  system  of  non¬ 
linear  equations  that  are  applicable  to  the  system  (15.19).  These  results  state  that  if 
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the  linearized  system  is  nonsingular  at  the  solution  (as  is  implied  by  our  assump¬ 
tions)  and  if  the  initial  point  is  sufficiently  close  to  the  solution,  the  method  will  in 
fact  converge  to  the  solution  and  the  convergence  will  be  of  order  at  least  two.  To 
guarantee  convergence  from  remote  initial  points  and  hence  be  more  broadly  ap¬ 
plicable,  it  is  desirable  to  use  the  method  as  a  descent  process.  Fortunately,  we  can 
show  that  the  direction  generated  by  Newton’s  method  is  a  descent  direction  for  the 
simple  merit  function 


m(x,  A)  =  l|V/(x,  A)|2  +  l|h(x)|2. 

Given  d^,  y^  satisfying  (15.23),  the  inner  product  of  this  direction  with  the  gradient 
of  m  at  x^,  \  is,  referring  to  (15.15), 


[L*I*  +  A[h*,  A/f I/f] 7  [dyf,  y^]  =  l[L*d k  +  h[A*d*  +  ljA[yt 

=  -\Ik\2  -  |h,|2. 


This  is  strictly  negative  unless  both  1^  =  0  and  =  0.  Thus  Newton’s  method  has 
desirable  global  convergence  properties  when  executed  as  a  descent  method  with 
variable  step  size. 

Note  that  the  calculation  above  does  not  employ  the  explicit  formulae  (15.24) 
and  (15.25),  and  hence  it  is  not  necessary  that  L(x,  A)  be  positive  definite,  as  long 
as  the  system  (15.21)  is  invertible.  We  summarize  the  above  discussion  by  the  fol¬ 
lowing  theorem. 

Theorem.  Define  the  Newton  process  by 

X£+i  —  +  u^d^ 

\+\  —  A  k  +  o^kyki 

where  d^,  are  solutions  to  (15.24)  and  where  is  selected  to  minimize  the  merit  function 

m(x.  A)  =  l|V/(x,  A)|2  +  l|h(x)|2. 

Assume  that  d^,  y^  exist  and  that  the  points  generated  lie  in  a  compact  set.  Then  any  limit 

point  of  these  points  satisfies  the  first-order  necessary  conditions  for  a  solution  to  the  con¬ 
strained  minimization  problem  (15.1). 

Proof.  Most  of  this  follows  from  the  above  observations  and  the  Global  Conver¬ 
gence  Theorem.  The  one-dimensional  search  process  is  well-defined,  since  the  merit 
function  m  is  bounded  below.  I 

In  view  of  this  result,  it  is  worth  pursuing  Newton’s  method  further.  We  would 
like  to  extend  it  to  problems  with  inequality  constraints.  We  would  also  like  to  avoid 
the  necessity  of  evaluating  L(x^,  A k)  at  each  step  and  to  consider  alternative  merit 
functions — perhaps  those  that  might  distinguish  a  local  maximum  from  a  local  min¬ 
imum,  which  the  simple  merit  function  does  not  do.  These  considerations  guide  the 
developments  of  the  next  several  sections. 
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Relation  to  Sequential  Quadratic  Programming 


It  is  clear  from  the  development  of  the  preceding  discussion  that  Newton’s  method 
is  closely  related  to  quadratic  programming  with  equality  constraints.  We  explore 
this  relationship  more  fully  here,  which  will  lead  to  a  generalization  of  Newton’s 
method  to  problems  with  inequality  constraints. 

Consider  the  problem 


minimize 
subject  to 


l[d*  +  id[L*d* 

A^d k  hk  —  0. 


(15.26) 


The  first-order  necessary  conditions  of  this  problem  are  exactly  (15.21),  or  equiv¬ 
alently  (15.23),  where  corresponds  to  the  Lagrange  multiplier  of  (15.26).  Thus, 
the  solution  of  (15.26)  produces  a  Newton  step. 

Alternatively,  we  may  consider  the  quadratic  program 


1  T 

minimize  Vf(xk)dk  +  -dk  Lkdk 
subject  to  =  0. 


(15.27) 


The  necessary  conditions  of  this  problem  are  exactly  (15.22),  where  A^+i  now  cor¬ 
responds  to  the  Lagrange  multiplier  of  (15.27).  The  program  (15.27)  is  obtained 

rT1 

from  (15.26)  by  merely  subtracting  \Akdk  from  the  objective  function;  and  this 
change  has  no  influence  on  d^,  since  A^d*  is  fixed. 

The  connection  with  quadratic  programming  suggests  a  procedure  for  extending 
Newton’s  method  to  minimization  problems  with  inequality  constraints.  Consider 
the  problem 

minimize  /(x) 

subject  to  h(x)  =  0 

g(x)  <  0. 

Given  an  estimated  solution  point  x^  and  estimated  Lagrange  multipliers  \xk, 
one  solves  the  quadratic  program 

minimize  V/(xfc)d/c  +  ±d[Lid* 

subject  to  Vh(x^)di  +  h*  =  0  (15.28) 

Vg(x*)d*  +  gk  <  0, 

where  Lk  =  F(xk)  +  \'kH(xk)  +  ^G(x*),  hk  =  h(x*),  gk  =  g(xk).  The  new  point  is 
determined  by  x^+i  =  x^  +  d^,  and  the  new  Lagrange  multipliers  are  the  Lagrange 
multipliers  of  the  quadratic  program  (15.28).  This  is  the  essence  of  an  early  method 
for  nonlinear  programming  termed  SOLVER.  It  is  a  very  attractive  procedure,  since 
it  applies  directly  to  problems  with  inequality  as  well  as  equality  constraints  without 
the  use  of  an  active  set  strategy  (although  such  a  strategy  might  be  used  to  solve 
the  required  quadratic  program).  Methods  of  this  general  type,  where  a  quadratic 
program  is  solved  at  each  step,  are  referred  to  as  recursive  quadratic  programming 
methods,  and  several  variations  are  considered  in  this  chapter. 
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As  presented  here  the  recursive  quadratic  programming  method  extends  Newton’s 
method  to  problems  with  inequality  constraints,  but  the  method  has  limitations.  The 
quadratic  program  may  not  always  be  well-defined,  the  method  requires  second- 
order  derivative  information,  and  the  simple  merit  function  is  not  a  descent  function 
for  the  case  of  inequalities.  Of  these,  the  most  serious  is  the  requirement  of  second- 
order  information,  and  this  is  addressed  in  the  next  section. 


15.4  Modified  Newton  Methods 


A  modified  Newton  method  is  based  on  replacing  the  actual  linearized  system  by  an 
approximation. 

First,  we  concentrate  on  the  equality  constrained  optimization  problem 


minimize  /(x) 
subject  to  h(x)  =  0 


(15.29) 


in  order  to  most  clearly  describe  the  relationships  between  the  various  approaches. 
Problems  with  inequality  constraints  can  be  treated  within  the  equality  constraint 
framework  by  an  active  set  strategy  or,  in  some  cases,  by  recursive  quadratic  pro¬ 
gramming. 

The  basic  equations  for  Newton’s  method  can  be  written 


x^+i 

A^+i 


L  *4 

A*  0 


1* 


where  as  before  Lk  is  the  Hessian  of  the  Lagrangian,  =  Vh(x^),  I*  =  [V/(x^)  + 
A^Vh(x^)]r,  hk  =  h(x*;).  A  structured  modified  Newton  method  is  a  method  of  the 
form 


Xfc+l 

\+\ 


~  B*  A[ 

A,  0 


1* 

hk 


(15.30) 


where  is  an  approximation  to  L The  term  “structured”  derives  from  the  fact  that 
only  second-order  information  in  the  original  system  of  equations  is  approximated; 
the  first-order  information  is  kept  intact. 

Of  course  the  method  is  implemented  by  solving  the  system 


B,d,  +  A[y*  =  -I*  (15.31) 

Akdk  =  -h^ 

for  dk  and  yk  and  then  setting  x^+i  =  x^  +  ak dk,  \k+i  =  \k  +  aky k  for  some  value 
of  ak.  In  this  section  we  will  not  consider  the  procedure  for  selection  of  ak,  and  thus 
for  simplicity  we  take  ak  =  1 .  The  simple  transformation  used  earlier  can  be  applied 
to  write  (15.31)  in  the  form 
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B,d,  +  A[A*+1  =  -V/(x,)r  (15.32) 

A^d^  — 

Then  xk+\  =  xk  +  d^,  and  A^+i  is  found  directly  as  a  solution  to  system  (15.32). 

There  are,  of  course,  various  ways  to  choose  the  approximation  B^.  One  is  to  use 
a  fixed,  constant  matrix  throughout  the  iterative  process.  A  second  is  to  base  Bk  on 
some  readily  accessible  information  in  L(x^,  such  as  setting  equal  to  the 
diagonal  of  L(x^,  \k).  Finally,  a  third  possibility  is  to  update  B^  using  one  of  the 
various  quasi-Newton  formulae. 

One  important  advantage  of  the  structured  method  is  that  B^  can  be  taken  to  be 
positive  definite  even  though  Lk  is  not.  If  this  is  done,  we  can  write  the  explicit 
solution 


y*  =  (A^Alr'ftk  -  A*BpIfc]  (15.33) 

d,  =  -B-'|I  -  A^A.B-'A^r'A.B-'ir  -  B“' A[(A,B-' A[)-'h,.  (15.34) 

Consider  the  quadratic  program 

1  T 

minimize  Vf(xk)  dk  +  -dk  Bkdk 
subject  to  A^d*;  +  h(xk)  =  0. 

The  first-order  necessary  conditions  for  this  problem  are 

B^  +  A^Ajt+i  =  -Vf(xk)T 
Akdk  =  -h(X|t), 

which  are  again  identical  to  the  system  of  equations  of  the  structured  modified 
Newton  method — in  this  case  in  the  form  (15.33).  The  Lagrange  multiplier  of  the 
quadratic  program  is  \+\.  The  equivalence  of  (15.35)  and  (15.36)  leads  to  a  recur¬ 
sive  quadratic  programming  method,  where  at  each  xk  the  quadratic  program  (15.35) 
is  solved  to  determine  the  direction  d^.  In  this  case  an  arbitrary  symmetric  matrix 
B^  is  used  in  place  of  the  Hessian  of  the  Lagrangian.  Note  that  the  problem  (15.35) 
does  not  explicitly  depend  on  \k,  but  B^,  often  being  chosen  to  approximate  the 
Hessian  of  the  Lagrangian,  may  depend  on  \k. 

As  before,  a  principal  advantage  of  the  quadratic  programming  formulation  is 
that  there  is  an  obvious  extension  to  problems  with  inequality  constraints:  One 
simply  employs  a  linearized  version  of  the  inequalities. 


(15.35) 


(15.36) 


15.5  Descent  Properties 

In  order  to  ensure  convergence  of  the  structured  modified  Newton  methods  of  the 
previous  section,  it  is  necessary  to  find  a  suitable  merit  function — a  merit  function 
that  is  compatible  with  the  direction-finding  algorithm  in  the  sense  that  it  decreases 
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along  the  direction  generated.  We  must  abandon  the  simple  merit  function  at  this 
point,  since  it  is  not  compatible  with  these  methods  when  Bk  ±  L&.  However,  two 
other  penalty  functions  considered  earlier,  the  absolute-value  exact  penalty  func¬ 
tion  and  the  quadratic  penalty  function,  are  compatible  with  the  modified  Newton 
approach. 


Absolute-  Value  Penalty  Function 

Let  us  consider  the  constrained  minimization  problem 

minimize  /(x)  (15.37) 

subject  to  g(x)  <  0, 

where  g(x)  is  r-dimensional.  For  notational  simplicity  we  consider  the  case  of  ine¬ 
quality  constraints  only,  since  it  is,  in  fact,  the  most  difficult  case.  The  extension 
to  equality  constraints  is  straightforward.  In  accordance  with  the  recursive  quadratic 
programming  approach,  given  a  current  point  x,  we  select  the  direction  of  movement 
d  by  solving  the  quadratic  programming  problem 

1  T 

minimize  -drBd  +  V/(x)d  (15.38) 

subject  to  Vg(x)d  +  g(x)  <  0, 

where  B  is  positive  definite. 

The  first-order  necessary  conditions  for  a  solution  to  this  quadratic  program  are 


Bd  +  V/(x)r  +  Vg(x)7(_i  =  0 

(15.39a) 

Vg(x)d  +  g(x)  <  0 

(15.39b) 

fV[Vg(x)d  +  g(x)]  =  0 

(15.39c) 

|u  >  0. 

(15.39d) 

Note  that  if  the  solution  to  the  quadratic  program  has  d  =  0,  then  the  point  x, 
together  with  \i  from  (15.39),  satisfies  the  first-order  necessary  conditions  for  the 
original  minimization  problem  (15.37).  The  following  proposition  is  the  fundamen¬ 
tal  result  concerning  the  compatibility  of  the  absolute-value  penalty  function  and 
the  quadratic  programming  method  for  determining  the  direction  of  movement. 

Proposition  1.  Let  d,  p  (with  d  ±  0)  he  a  solution  of  the  quadratic  program  (15.38).  Then 

if  c  >  ma x(pj),  the  vector  d  is  a  descent  direction  for  the  penalty  function 

j 


P(x)  =  /(x)  +  c^gj(x)+. 

j=  1 
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Proof.  Let  J(x)  =  {j  :  gj(x)  >  0}.  Now  for  a  >  0, 

r 

P(x  +  ad)  =  fix  +  crd)  +  c  ^  gj(x  +  ad)+ 

7=1 

r 

=  /(x)  +  ffV/(x)d  +  c  2l,?/x)  +  aVg/x)d]+  +  o(ar) 

7=1 

r 

=  /(x)  +  0rV/(x)d  +  c  L  #/x)+  +  ac  Y  Vg/x)d  +  o(ar) 

7=1  ,/e7(x) 

=  P(x)  +  aVf(x)d  +  ac  ^  Vgy-(x)d  +  o(a).  (15.40) 

,/e7(x) 

Where  (15.39b)  was  used  in  the  third  line  to  infer  that  Vg7-(x)  <  0  if  gj(x)  =  0. 
Again  using  (15.39b)  we  have 

r 

c  Y  V?/x)d  <  C  Y  ~8jOO  =  -cYgJ(x)+-  (15.41) 

;e/(x)  jeJ(x)  7=1 

Using  (15.39a)  we  have 

r 

Vf(x)d  =  -d'  Bd  -  YvjVsM) d, 

7=1 

which  by  using  the  complementary  slackness  condition  (15.39c)  leads  to 

r  r 

V/(x)d  =  -drBd  +  L  Vjg/x)  <  -d^Bd  +  Y^(x)+  (15‘42) 

7=1  7=1 

r 

<  -dr  Bd  +  max  (jij)  ^  ^7(x)+. 

7=1 

Finally,  substituting  (15.41)  and  (15.42)  in  (15.40),  we  find 

r 

P(x  +  crd)  <  P(x)  +  cr{-drBd  -  [c  -  ma ^  gj(x)+}  +  o(a), 

7=1 

Since  B  is  positive  definite  and  c  >  ma x(//;),  it  follows  that  for  a  sufficiently  small, 
P(x  +  ad)  <  P(x).  I 

The  above  proposition  is  exceedingly  important,  for  it  provides  a  basis  for  est¬ 
ablishing  the  global  convergence  of  modified  Newton  methods,  including  recursive 
quadratic  programming.  The  following  is  a  simple  global  convergence  result  based 
on  the  descent  property. 


15.5  Descent  Properties 
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Theorem.  Let  B  be  positive  definite  and  assume  that  throughout  some  compact  region 

c  En,  the  quadratic  program  (15.38)  has  a  unique  solution  d,  \i  such  that  at  each  point 

the  Lagrange  multipliers  satisfy  ma x(/q)  <  c.  Let  the  sequence  {x^}  be  generated  by 

j 


x&+i  —  X£  +  cr^d/j, 


where  dk  is  the  solution  to  (15.38)  at  x ^  and  where  a ^  minimizes  P(x^+1).  Assume  that  each 
Xk  e  fl  Then  every  limit  point  x  of{Xk }  satisfies  the  first- order  necessary  conditions  for  the 
constrained  minimization  problem  (15.37). 

Proof.  The  solution  to  a  quadratic  program  depends  continuously  on  the  data,  and 
hence  the  direction  determined  by  the  quadratic  program  (15.38)  is  a  continuous 
function  of  x.  The  function  P(x)  is  also  continuous,  and  by  Proposition  1 ,  it  fol¬ 
lows  that  P  is  a  descent  function  at  every  point  that  does  not  satisfy  the  first-order 
conditions.  The  result  thus  follows  from  the  Global  Convergence  Theorem.  I 

In  view  of  the  above  result,  recursive  quadratic  programming  in  conjunction  with 
the  absolute- value  penalty  function  is  an  attractive  technique.  There  are,  however, 
some  difficulties  to  be  kept  in  mind.  First,  the  selection  of  the  parameter  ak  requires 
a  one-dimensional  search  with  respect  to  a  nondifferentiable  function.  Thus  the  eff¬ 
icient  curve-fitting  search  methods  of  Chap.  8  cannot  be  used  without  significant 
modification.  Second,  use  of  the  absolute- value  function  requires  an  estimate  of  an 
upper  bound  for  /i/s,  so  that  c  can  be  selected  properly.  In  some  applications  a 
suitable  bound  can  be  obtained  from  previous  experience,  but  in  general  one  must 
develop  a  method  for  revising  the  estimate  upward  when  necessary. 

Another  potential  difficulty  with  the  quadratic  programming  approach  above  is 
that  the  quadratic  program  (15.38)  may  be  infeasible  at  some  point  x^,  even  though 
the  original  problem  (15.37)  is  feasible.  If  this  happens,  the  method  breaks  down. 
However,  see  Exercise  8  for  a  method  that  avoids  this  problem. 


The  Quadratic  Penalty  Function 

Another  penalty  function  that  is  compatible  with  the  modified  Newton  method 
approach  is  the  standard  quadratic  penalty  function.  It  has  the  added  technical  adv¬ 
antage  that,  since  this  penalty  function  is  differentiable,  it  is  possible  to  apply  our 
earlier  analytical  principles  to  study  the  rate  of  convergence  of  the  method.  This 
leads  to  an  analytical  comparison  of  primal-dual  methods  with  the  methods  of  other 
chapters. 

We  shall  restrict  attention  to  the  problem  with  equality  constraints,  since  that  is 
all  that  is  required  for  a  rate  of  convergence  analysis.  The  method  can  be  extended 
to  problems  with  inequality  constraints  either  directly  or  by  an  active  set  method. 
Thus  we  consider  the  problem 


minimize  /(x) 
subject  to  h(x)  =  0 


(15.43) 
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and  the  standard  quadratic  penalty  objective 

P(x)  =  f(x)  +  tc|h(x)|2.  (15.44) 

From  the  theory  in  Chap.  13,  we  know  that  minimization  of  the  objective  with 
a  quadratic  penalty  function  will  not  yield  an  exact  solution  to  (15.43).  In  fact, 
the  minimum  of  the  penalty  function  (15.44)  will  have  ch(x)  ^  A,  where  A  is  the 
Lagrange  multiplier  of  (15.43).  Therefore,  it  seems  appropriate  in  this  case  to  con¬ 
sider  the  quadratic  programming  problem 

1  T 

minimize  -drBd  +  V/(x)d  (15.45) 

subject  to  Vh(x)d  +  h(x)  =  A/c, 

/V 

where  A  is  an  estimate  of  the  Lagrange  multiplier  of  the  original  problem.  A  partic¬ 
ularly  good  choice  is 

A  =  [(l/c)I  +  Q]_1[h(x)  -  AB"1  V/(x)r],  (15.46) 

where  A  =  Vh(x),  Q  =  AB  -  1 A T  which  is  the  Lagrange  multiplier  that  would  be 
obtained  by  the  quadratic  program  with  the  penalty  method.  The  proposed  method 

/V 

requires  that  A  be  first  estimated  from  (15.46)  and  then  used  in  the  quadratic  pro¬ 
gramming  problem  (15.45). 

The  following  proposition  shows  that  this  procedure  produces  a  descent  direction 
for  the  quadratic  penalty  objective. 

Proposition  2.  For  any  c  >  0,  let  d,  A  (with  d  4  0 )  be  a  solution  to  the  quadratic  pro¬ 
gram  (15.45).  Then  d  is  a  descent  direction  of  the  function  P(x)  =  /(x)  +  (l/2)c|h(x)|2. 

Proof.  We  have  from  the  constraint  equation 

Ad  =  (l/c)A-h(x), 

which  yields 

cArAd  =  ArA  -  cArh(x). 

Solving  the  necessary  conditions  for  (15.45)  yields  (see  the  top  part  of  (15.9)  for  a 
similar  expression  with  Q  =  B  there) 

Bd  =  ArQ" 1  [AB- 1  Vf (x)r  +  (l/c)A  -  h(x)]  -  V/(x)7'. 

Therefore, 

(B  +  cArA)d  =  ArQ“1[AB“1  V/(x)r  -  h(x)] 

+  Ar[(l/c)Q-1  + 1] A  -  V/(x)r  -  cArh(x) 

=  ArQ“1{AB“1V/(x)r  -  h(x)  +  ((l/c)I  +  Q)A} 

-  V/(x)r  -  cArh(x) 

=  -V/(x)r  -  cArh(x)  =  -VP(x)T. 

The  matrix  (B  +  cAT\)  is  positive  definite  for  any  c  >  0.  It  follows  that  VP(x)d  <  0. 

I 
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*15.6  *Rate  of  Convergence 

It  is  now  appropriate  to  apply  the  principles  of  convergence  analysis  that  have  been 
repeatedly  emphasized  in  previous  chapters  to  the  recursive  quadratic  programming 
approach.  We  expect  that,  if  this  new  approach  is  well  founded,  then  the  rate  of 
convergence  of  the  algorithm  should  be  related  to  the  familiar  canonical  rate,  which 
we  have  learned  is  a  fundamental  measure  of  the  complexity  of  the  problem.  If  it  is 
not  so  related,  then  some  modification  of  the  algorithm  is  probably  required.  Indeed, 
we  shall  find  that  a  small  but  important  modification  is  required. 

From  the  proof  of  Proposition  2  of  Sect.  15.5,  we  have  the  formula 

(B  +  cArA)d  =  -V/>(x)r, 


which  can  be  written  as 


d  =  -(B  +  cArA)-1V/>(x)7'. 

This  shows  that  the  method  is  a  modified  Newton  method  applied  to  the  uncon¬ 
strained  minimization  of  P(x).  From  the  Modified  Newton  Method  Theorem  of 
Sect.  10.1,  we  see  immediately  that  the  rate  of  convergence  is  determined  by  the 
eigenvalues  of  the  matrix  that  is  the  product  of  the  coefficient  matrix  (B  +  cArA)-1 
and  the  Hessian  of  the  function  P  at  the  solution  point.  The  Hessian  of  P  is 
(L  +  cArA),  where  L  =  F(x)  +  ch(x)rH(x).  We  know  that  the  vector  ch(x)  at 

T> 

the  solution  of  the  penalty  problem  is  equal  to  Ac,  where  V/(x)  +  Ac  Vh(x)  =  0. 
Therefore,  the  rate  of  convergence  is  determined  by  the  eigenvalues  of 

(B  +  cArA)_1(L  +  cArA),  (15.47) 

where  all  quantities  are  evaluated  at  the  solution  to  the  penalty  problem  and  L  = 
F  +  Ac  H.  For  large  values  of  c,  all  quantities  are  approximately  equal  to  the  values 
at  the  optimal  solution  to  the  constrained  problem. 

Now  what  we  wish  to  show  is  that  as  c  — >  oo,  the  matrix  (15.47)  looks  like  B^Lm 
on  the  subspace,  M,  and  like  the  identity  matrix  on  Mx,  the  subspace  orthogonal  to 
M.  To  do  this  in  detail,  let  C  be  an  n  x  (n  -  m)  matrix  whose  columns  form  an 
orthonormal  basis  for  M,  the  tangent  subspace  {x  :  Ax  =  0}.  Let  D  =  Ar(AAr)-1. 
Then  AC  =  0,  AD  =  I,  CrC  =  I,  CrD  =  0. 

The  eigenvalues  of  (B  +  cArA)-1(L  +  cArA)  are  equal  to  those  of 

[C,  D]-1(B  +  cArA)-1{[C,  D]r}_1[C,  D]r(L  +  cArA)[C,  D] 


CrBC  CrBD 

-1 

CrLC  CrLD 

DrBC  DrBC  +  cl 

DrLC  DrLD  +  cl 

Now  as  c  — >  oo,  the  matrix  above  approaches 

B^L M  BMCr(L-B)D 
0  I 
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where  BM  =  CrBC,  L M  -  CrLC  (see  Exercise  6).  The  eigenvalues  of  this  matrix 
are  those  of  B^L M  together  with  those  of  I.  This  analysis  leads  directly  to  the 
following  conclusion: 

Theorem.  Let  a ,  A  be  the  smallest  and  largest  eigenvalues,  respectively,  of  B^1  and 
assume  that  a  <  1  <  A.  Then  the  structured  modified  Newton  method  with  quadratic  penalty 
function  has  a  rate  of  convergence  no  greater  than  [(A  -  a) /{A  +  a)]  as  c  — »  oo. 

In  the  special  case  of  B  =  I,  the  rate  in  the  above  proposition  is  precisely  the 
canonical  rate,  defined  by  the  eigenvalues  of  L  restricted  to  the  tangent  plane.  It  is 
important  to  note,  however,  that  in  order  for  the  rate  of  the  theorem  to  be  achieved, 
the  eigenvalues  of  B^L  M  must  be  spread  around  unity;  if  not,  the  rate  will  be  poorer. 
Thus,  even  if  L m  is  well-conditioned,  but  the  eigenvalues  differ  greatly  from  unity, 
the  choice  B  =  I  may  be  poor.  This  is  an  instance  where  proper  scaling  is  vital. 
(We  also  point  out  that  the  above  analysis  is  closely  related  to  that  of  Sect.  13.4, 
where  a  similar  conclusion  is  obtained.) 

There  is  a  geometric  explanation  for  the  scaling  property.  Take  B  =  I  for  sim¬ 
plicity.  Then  the  direction  of  movement  d  is  d  =  -  V/(x)r  +  ArA  for  some  A.  Using 
the  fact  that  the  projected  gradient  is  p  =  V/(x)r  +  A T n  for  some  //,  we  see  that 
d  =  -p  +  Ar(A  +  n).  Thus  d  can  be  decomposed  into  two  components:  one  in  the 
direction  of  the  projected  negative  gradient,  the  other  in  a  direction  orthogonal  to 
the  tangent  plane  (see  Fig.  15.1).  Ideally,  these  two  components  should  be  in  proper 
proportions  so  that  the  constraint  surface  is  reached  at  the  same  point  as  would  be 
reached  by  minimization  in  the  direction  of  the  projected  negative  gradient.  If  they 
are  not,  convergence  will  be  poor. 


Fig.  15.1  Decomposition  of  the  direction  d 
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15.7  Primal-Dual  Interior  Point  Methods 

The  primal-dual  interior-point  methods  discussed  for  linear  programming  in  Chap.  5 
are,  as  mentioned  there,  closely  related  to  the  barrier  methods  presented  in  Chap.  13 
and  the  primal-dual  methods  of  the  current  chapter.  They  can  be  naturally  extended 
to  solve  nonlinear  programming  problems  while  maintaining  both  theoretical  and 
practical  efficiency. 

Consider  the  inequality  constrained  problem 

minimize  /(x) 

subject  to  Ax  =  b,  (15.48) 

g(x)  <  0, 

In  general,  a  weakness  of  the  active  constraint  method  for  such  a  problem  is  the 
combinatorial  nature  of  determining  which  constraints  should  be  active. 


Logarithmic  Barrier  Function 

A  method  that  avoids  the  necessity  to  explicitly  select  a  set  of  active  constraints 
is  based  on  the  logarithmic  barrier  method,  which  solves  a  sequence  of  equality 
constrained  minimization  problems.  Specifically, 

p 

minimize  /(x)  -  jj  X  log(-g,(x))  (15.49) 

i= 1 

subject  to  Ax  =  b, 

where  /i  =  jik  >  0,  k  -  1,  . .  .„  jik  >  //+1,  //  — >  0.  The  //s  can  be  pre-determined. 
Typically,  we  have  /ik+]  =  yjuk  for  some  constant  0  <  y  <  1.  Here,  we  also  assume 
that  the  original  problem  has  a  feasible  interior-point  x°;  that  is, 

Ax°  =  b  and  g(x°)  <  0, 


and  A  has  full  row  rank. 

For  fixed  /i,  and  using  S[  =  fi/gi,  the  first-order  optimality  conditions  of  the 
barrier  problem  (15.49)  are: 

-  Sg(x)  =  ii\ 

Ax  =  b  (15.50) 

-Ary  +  V/(x)r  +  Vg(x)rs  =  0, 

where  S  =  diag(s);  that  is,  a  diagonal  matrix  whose  diagonal  entries  are  s,  and  Vg(x) 
is  the  Jacobian  matrix  of  g(x). 
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If  /(x)  and  gi(x)  are  convex  functions  for  all  /,  /(x)  -  p  2/  log(— ,g/(x))  is  strictly 
convex  in  the  interior  of  the  feasible  region,  and  the  objective  level  set  is  bounded, 
then  there  is  a  unique  minimizer  for  the  barrier  problem.  Let  (x(p)  >  0,  y(p), 
s(p)  >  0)  be  the  (unique)  solution  of  (15.50).  Then,  these  values  form  the  primal- 
dual  central  path  of  (15.48): 

C  =  {(x(/z),  y{p),  s (p)  >  0)  :  0  <  p  <  oo}. 

This  can  be  summarized  in  the  following  theorem. 

Theorem  1.  Let  (x(p),  yip),  sip))  be  on  the  central  path. 

i)  If  f(x)  and  gfx)  are  convex  functions  for  all  i,  then  sip )  is  unique. 

ii)  Furthermore,  if  fix)  -p  L/  l°g(-g/(x))  is  strictly  convex, (xip),  yip),  sip))  are  unique, 
and  they  are  bounded  for  0  <  p  <  p°  for  any  given  p°  >  0. 

iii)  For  0  <  p'  <  p,  f(x(p'))  <  f(x(p))  ifxip')  ±  xip). 

iv)  ixip),  yip),  s ip))  converges  to  a  point  satisfying  the  first-order  necessary  conditions 
for  a  solution  of  (15.48)  as  p  ^  0. 


Once  we  have  an  approximate  solution  point  (x,  y,  s)  =  (x^,  y^,  sk)  for  (15.50) 
for  p  =  pk  >  0,  we  can  again  use  the  primal-dual  methods  described  for  linear 
programming  to  generate  a  new  approximate  solution  to  (15.50)  for  p  =  pk+l  < 
pk.  The  Newton  direction  vectors  (dx,  dy,  ds)  is  found  from  the  system  of  linear 
equations: 


-SVg(x)dx  -  G(x)ds  =  pi  +  Sg(x), 

Adx  =  b  -  Ax, 


(15.51) 


■A7  dy  + 


V2/(x)  +  L  ^V2g,(x) 


+Vg(x)rds  =  Ary  -  V/(x)T  -  Vg(x)rs, 


where  G(x)  =  diag(g(x)).  Then,  the  new  iterate  is  update  to: 

(x*+ 1,  yjt+i, sjt+i)  =  (xk,  yk,sk)  +  ak( dx,  dy,ds) 

for  a  stepsize  ak.  Recently,  this  approach  has  also  been  used  to  find  points  satisfying 
the  first-order  conditions  for  problems  when  /(x)  and  gfix)  are  not  generally  convex 
functions. 


Interior  Point  Method  for  Convex  Quadratic  Programming 


Let  fix)  =  (l/2)xrQx  +  cTx  and  gfx)  =  -xi  for  i  =  1,  . . . ,  n ,  and  consider  the 
quadratic  program 


minimize 
subject  to 


1  t  rr 

-x  Qx  +  c  x 

2 

Ax  =  b, 
x  >  0, 


(15.52) 
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where  the  given  matrix  Q  e  EnXn  is  positive  semidefinite  (that  is,  the  objective  is  a 
convex  function),  A  e  EnXm ,  c  e  En  and  b  e  Em.  The  problem  reduces  to  finding 
x  £  En,  ye  Em  and  s  e  En  satisfying  the  following  optimality  conditions: 

Sx  =  0 

Ax  =  b  (15.53) 

-Ary  +  Qx  -  s  =  -c 
(x,  s)  >  0. 

The  optimality  conditions  with  the  logarithmic  barrier  function  with  parameter  p 
are  be: 


Sx  =  jjI 

Ax  =  b  (15.54) 

-Ary  +  Qx  -  s  =  -c. 

Note  that  the  bottom  two  sets  of  constraints  are  linear  equalities. 

Thus,  once  we  have  an  interior  feasible  point  (x,  y,  s)  for  (15.54),  with  p  = 
xTs/n,  we  can  apply  Newton’s  method  to  compute  a  new  (approximate)  iterate 
(x+,  y+,  s+)  by  solving  for  (dx,  dy,  ds)  from  the  system  of  linear  equations: 

Sdx  +  Xds  =  y/il  -  Xs, 

Adx  =  0,  (15.55) 

-Ardy  +  Qdx  -  ds  =  0, 


where  X  and  S  are  two  diagonal  matrices  whose  diagonal  entries  are  x  >  0  and 
s  >  0,  respectively.  Here,  y  is  a  fixed  positive  constant  less  than  1,  which  implies 
that  our  targeted  p  is  reduced  by  the  factor  y  at  each  step. 


Potential  Function  as  a  Merit  Function 

For  any  interior  feasible  point  (x,  y,  s)  of  (15.52)  and  its  dual,  a  suitable  merit 
function  is  the  potential  function  introduced  in  Chap.  5  for  linear  programming: 

n 

^„+p(x,  s)  =  (n  +  p)  log(xrs)  -  2_J  log (xjSj). 

j=i 

The  main  result  for  this  is  stated  in  the  following  theorem. 

Theorem  2.  In  solving  (15.55)  for  (dx,  dy,  ds),  let  y  -  n/(n  +  p)  <  l  for  fixed  p  >  sfn  and 
assign  x+  =  x  +  udx,  y+  =  y  +  ady,  and  s+  =  s  +  ads  where 

a  Vmin(Xs) 

l(xs)-i/2(gn_xs)f 


a  = 
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where  a  is  any  positive  constant  less  than  1.  ( Again  X  and  S  are  matrices  with  components 
on  the  diagonal  being  those  ofx  and  s,  respectively .)  Then, 

—2 

i//n+p(x+,  s+)  -  if/ n+p(x,  s)  <  -a  y/Jji  +  _  . 

The  proof  of  the  theorem  is  also  similar  to  that  for  linear  programming;  see 
Exercise  12.  Notice  that,  since  Q  is  positive  semidefinite,  we  have 

dx  rds  =  (dx,  dy)r(ds,  0)  =  djQdx  >  0 

while  d^ds  =  0  in  the  linear  programming  case. 

We  outline  the  algorithm  here: 

Given  any  interior  feasible  (xo,  yo,  So)  of  (15.52)  and  its  dual  Set  p  >  yfn  and 
k  =  0. 

1.  Set  (x,  s)  =  (Xfc,  Sk)  andy  -  n/(n  +  p)  and  compute  (dx,  dy,  ds)from  (15.55). 

2.  Let  xk+i  =  xk  +  adx,  y^+i  =  +  ady,  and  s^+i  =  +  ads  where 


a  =  arg min  if/n+p(xk  +  adx,  sk  +  ads). 

a>0  y 

3.  Let  k  =  k  +  1.  If  sJx^/SqXo  <  s,  stop.  Otherwise,  return  to  Step  1. 

This  algorithm  exhibits  an  iteration  complexity  bound  that  is  identical  to  that  of 
linear  programming  expressed  in  Theorem  1,  Sect.  5.6. 


15.8  Summary 

A  constrained  optimization  problem  can  be  solved  by  directly  solving  the  equations 
that  represent  the  first-order  necessary  conditions  for  a  solution.  For  a  quadratic  pro¬ 
gramming  problem  with  linear  constraints,  these  equations  are  linear  and  thus  can 
be  solved  by  standard  linear  procedures.  Quadratic  programs  with  inequality  con¬ 
straints  can  be  solved  by  an  active  set  method  in  which  the  direction  of  movement  is 
toward  the  solution  of  the  corresponding  equality  constrained  problem.  This  method 
will  solve  a  quadratic  program  in  a  finite  number  of  steps. 

For  general  nonlinear  programming  problems,  many  of  the  standard  methods 
for  solving  systems  of  equations  can  be  adapted  to  the  corresponding  necessary 
equations.  One  class  consists  of  first-order  methods  that  move  in  a  direction  related 
to  the  residual  (that  is,  the  error)  in  the  equations.  Another  class  of  methods  is  based 
on  extending  the  method  of  conjugate  directions  to  nonpositive-definite  systems. 
Finally,  a  third  class  is  based  on  Newton’s  method  for  solving  systems  of  nonlinear 
equations,  and  solving  a  linearized  version  of  the  system  at  each  iteration.  Under 
appropriate  assumptions,  Newton’s  method  has  excellent  global  as  well  as  local  con¬ 
vergence  properties,  since  the  simple  merit  function,  ^|V/(x)  + ArVh(x)|2  +  ^|h(x)|2, 
decreases  in  the  Newton  direction.  An  individual  step  of  Newton’s  method  is 
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equivalent  to  solving  a  quadratic  programming  problem,  and  thus  Newton’s  method 
can  be  extended  to  problems  with  inequality  constraints  through  recursive  quadratic 
programming. 

More  effective  methods  are  developed  by  accounting  for  the  special  structure 
of  the  linearized  version  of  the  necessary  conditions  and  by  introducing  approxi¬ 
mations  to  the  second-order  information.  In  order  to  assure  global  convergence  of 
these  methods,  a  penalty  (or  merit)  function  must  be  specified  that  is  compatible 
with  the  method  of  direction  selection,  in  the  sense  that  the  direction  is  a  direction 
of  descent  for  the  merit  function.  The  absolute- value  penalty  function  and  the  stan¬ 
dard  quadratic  penalty  function  are  both  compatible  with  some  versions  of  recursive 
quadratic  programming. 

The  best  of  the  primal-dual  methods  take  full  account  of  special  structure,  and  are 
based  on  direction-finding  procedures  that  are  closely  related  to  methods  described 
in  earlier  chapters.  It  is  not  surprising  therefore  that  the  convergence  properties  of 
these  methods  are  also  closely  related  to  those  of  other  chapters.  Again  we  find  that 
the  canonical  rate  is  fundamental  for  properly  designed  first-order  methods. 

Interior  point  methods  in  the  primal-dual  model  are  very  effective  for  treating 
problems  with  inequality  constraints,  for  they  avoid  (or  at  least  minimize)  the  diffi¬ 
culties  associated  with  determining  which  constraints  will  be  active  at  the  solution. 
Applied  to  general  nonlinear  programming  problems,  these  methods  closely  parallel 
the  interior  point  methods  for  linear  programming.  There  is  again  a  central  path,  and 
Newton’s  method  is  a  good  way  to  follow  the  path. 


15.9  Exercises 

1 .  Solve  the  quadratic  program 

x2  -  xy  +  y2  -  3x 
x  >  0 

y  >  o 

v  +  y  <  4 

by  use  of  the  active  set  method  starting  at  v  =  y  =  0. 

2.  Suppose  x*,  A*  satisfy 


minimize 
subject  to 


V/(x*)  +  A*rVh(x*)  =  0 
h(x*)  =  0. 

Let 

[L(x*,A*)  Vh(x*)r 

[  Vh(x*)  0 

Assume  that  L(x*,  A*)  is  positive  definite  and  that  Vh(x*)  is  of  full  rank. 
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(a)  Show  that  the  real  part  of  each  eigenvalue  of  C  is  positive. 

(b)  Using  the  result  of  Part  (a),  show  that  for  some  a  >  0  the  iterative  process 

X*+1  =  x*  -  aVl(xk,  \k)T  and  \k+i  =\k  +  ah(xt) 

converges  locally  to  x*,  A*.  (That  is,  if  started  sufficiently  close  to  x*,  A*, 
the  process  converges  to  x\  A* .)  Hint :  Use  Ostroski’s  Theorem:  Let  A(z)  be 
a  continuously  differentiable  mapping  from  Ep  to  Ep ,  assume  A(z*)  =  0, 
and  let  VA(z*)  have  all  eigenvalues  strictly  inside  the  unit  circle  of  the 
complex  plane.  Then  z^+i  -  Zk  +  A(z k)  converges  locally  to  z*. 

3.  Let  A  be  a  real  symmetric  matrix.  A  vector  x  is  singular  if  xT  Ax  =  0.  A  pair 
of  vectors  x,  y  is  a  hyperbolic  pair  if  both  x  and  y  are  singular  and  xrAy  ^  0. 
Hyperbolic  pairs  can  be  used  to  generalize  the  conjugate  gradient  method  to  the 
nonpositive  definite  case. 

(a)  If  is  singular,  show  that  if  p^+i  is  defined  as 

(Ap^)7'A2pt^ 

P*+1  “  Pk  2\Apk\2  P*’ 

then  p k,  Pa+i  is  a  hyperbolic  pair. 

(b)  Consider  a  modification  of  the  conjugate  gradient  process  of  Sect.  8.1, 
where  if  p^  is  singular,  p^+i  is  generated  as  above,  and  then 


Xfc+1  —  X-k  +  O^kt^k 
X k+2  —  X£+ 1  +  ak+lPk+l 
rkVk+l 

k  t  a  ’  ^ k+1  ~ 

Pk  Ap/,-+ ! 

_  _  -  rf+2AP^+l 

P&+2  -  *lc+ 2 - 7 - 

P/Ap<  +  | 


rjj>k 

p[Ap/:+i 

Pk 


Show  that  if  p^+i  is  the  second  member  of  a  hyperbolic  pair  and  r k  ±  0, 
then  Xk+2  ^  Xjfc+i,  which  means  the  process  does  not  get  “stuck.” 

4.  Another  method  for  solving  a  system  Ax  =  b  when  A  is  nonsingular  and  sym¬ 
metric  is  the  conjugate  residual  method.  In  this  method  the  direction  vectors  are 
constructed  to  be  an  A2-orthogonalized  version  of  the  residuals  =  b  -  Ax^. 
The  error  function  E(x)  =  |Ax  -  b|2  decreases  monotonically  in  this  process. 
Since  the  directions  are  based  on  r \  rather  than  the  gradient  of  E ,  which  is 
2A rk,  the  method  extends  the  simplicity  of  the  conjugate  gradient  method  by 
implicit  use  of  the  fact  that  A2  is  positive  definite.  The  method  is  this:  Set 
Pi  =  ri  =  b- Axi  and  repeat  the  following  steps,  omitting  (a,  b)  on  the  first  step. 
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If  a k-i  ±  0, 


If  ak- 1  =  0, 


P&  —  flk 


r[A2PA-i 
p[_iA2Pn  ‘ 


P k  =  Arfc  -  yj-pA-i  -  Sk pk-2 


r[A3pt-2 

p[_2A3p^_2 


r[Ap/: 

X&+ 1  —  X^  +  (%k  Pfc?  &k  ~  r  a  o 

Pa  A2  Pa 


rk+i  =  b- Axjt+i. 


Show  that  the  directions  p^  are  A2 -orthogonal. 

5.  Consider  the  ( n  +  m)- dimensional  system  of  equations 


L 

A 


(15.56a) 


(15.56b) 

(15.56c) 

(15.56d) 


Suppose  that  A  =  [B,  C],  where  B  is  mx  m  and  invertible.  Let  x  =  ( xB ,  xc), 
where  xB  is  the  first  m  components  of  x.  The  system  can  then  be  written 


Lbb  Lbc  Br 

XB 

<*B 

Lcb  Lcc 

CT 

xc 

— 

ac 

B  C 

0 

A 

b 

(a)  Assume  that  L  is  positive  definite  on  the  tangent  space  {x  :  Ax  =  0}.  Derive 
an  explicit  statement  equivalent  to  this  assumption  in  terms  of  the  positive 
definiteness  of  some  (n-  m)  x  (n  -  m)  matrix. 

(b)  Solve  the  system  in  terms  of  the  submatrices  of  the  partitioned  form. 

6.  Consider  the  partitioned  square  matrix  M  of  the  form 


M  = 


AB 

CD 


Show  that 


Q 

-D‘CQ 


-QBD1 

D1  +D1CQBD“1 


where  Q  =  (A  -  BD  'C)1,  provided  that  all  indicated  inverses  exist.  Use  this 
result  to  verify  the  rate  of  convergence  result  in  Sect.  15.6. 

7.  For  the  problem 


minimize  /(x) 
subject  to  g(x)  <  0, 
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where  g(x)  is  r-dimensional,  define  the  penalty  function 

p(x)  =  /(x)  +  c  max{0,  gi(x),  g2(x),  gr(x)}. 

Let  d,  (d  ^  0)  be  a  solution  to  the  quadratic  program 


minimize 
subject  to 


-drBd  +  V/(x)d 
g(x)  +  Vg(x)d  <  0, 


where  B  is  positive  definite.  Show  that  d  is  a  descent  direction  for  p  for  suffi¬ 
ciently  large  c. 

8.  Suppose  the  quadratic  program  of  Exercise  7  is  not  feasible.  In  that  case  one 
may  solve 


minimize 
subject  to 


ldrBd  +  V/(x)d  +  cf 
g(x)  +  Vg(x)d  <  f  1 


(a)  Show  that  if  d  ^  0  is  a  solution,  then  d  is  a  descent  direction  for  p. 

(b)  If  d  =  0  is  a  solution,  show  that  x  is  a  critical  point  of  p  in  the  sense  that  for 
any  d  ^  0,  p(x  +  ad)  >  p(x)  +  o(a). 

9.  For  the  equality  constrained  problem,  consider  the  function 

0(x)  =  /(x)  +  A(x)rh(x)  +  ch(x)rC(x)C(x)rh(x), 


where 


C(x)  =  [  Vh(x)  Vh(x)r]  “ 1  Vh(x)  and  A(x)  =  C  (x)V/(x)r. 


(a)  Under  standard  assumptions  on  the  original  problem,  show  that  for  suffi¬ 
ciently  large  c,  (f>  is  (locally)  an  exact  penalty  function. 

(b)  Show  that  0(x)  can  be  expressed  as 

0(x)  =  f(x)  +  7r(x)rh(x), 

where  7r(x)  is  the  Lagrange  multiplier  of  the  problem 


minimize 
subject  to 


lCdrd  +  V/(x)d 
Vh(x)d  +  h(x)  =  0. 


(c)  Indicate  how  0  can  be  defined  for  problems  with  inequality  constraints. 

10.  Let  {Bk}  be  a  sequence  of  positive  definite  symmetric  matrices,  and  assume 
that  there  are  constants  a  >  0,  b  >  0  such  that  a\x\2  <  xrB^x  <  Z?|x|2  for 
all  x.  Suppose  that  B  is  replaced  by  B^  in  the  £th  step  of  the  recursive  quadratic 
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programming  procedure  of  the  theorem  in  Sect.  15.4.  Show  that  the  conclusions 
of  that  theorem  are  still  valid.  Hint :  Note  that  the  set  of  allowable  B^’s  is  closed. 

1 1 .  (Central  path  theorem)  Prove  the  central  path  theorem,  Theorem  1  of  Sect.  15.7, 
for  convex  optimization. 

12.  Prove  the  potential  reduction  theorem,  Theorem 2  of  Sect.  15.7,  for  convex 
quadratic  programming.  This  theorem  can  be  generalized  to  non-quadratic  con¬ 
vex  objective  functions  /(x)  satisfying  the  following  condition:  let 

u  :  (0, 1)  — >  (1,  oo) 

be  a  monotone  increasing  function;  then 

I X(V/(x  +  dx)  -  V/(x)  -  V2/(x)dx)|i  <  «(a)<V/(x)dx 

whenever 

x  >  0,  |X_1dx|oo  <  a  <  1. 

Such  condition  is  called  the  scaled  Lipschitz  condition  in  {x  :  x  >  0}. 
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Appendix  A 

Mathematical  Review 


The  purpose  of  this  appendix  is  to  set  down  for  reference  and  review  some  basic 
definitions,  notation,  and  relations  that  are  used  frequently  in  the  text. 


A.l  Sets 

If  a  is  a  member  of  the  set  S ,  we  write  x  £  S .  We  write  y  £  S  if  y  is  not  a  member 
of  S . 

A  set  S  may  be  specified  by  listing  its  elements  between  braces;  such  as,  for 
example,  S  =  {1, 2, 3, 4}.  Alternatively,  a  set  can  be  specified  in  the  form  S  =  {x  : 
P(x)}  as  the  set  of  elements  satisfying  property  P;  such  as  S  =  {x  :  l  <  x  < 
4,  a  integer} 

The  union  of  two  sets  S  and  T  is  denoted  S  U  T  and  is  the  set  consisting  of 
the  elements  that  belong  to  either  S  or  T.  The  intersection  of  two  sets  S  and  T  is 
denoted  S  Pi  T  and  is  the  set  consisting  of  the  elements  that  belong  to  both  S  and  T. 
If  ^  is  a  subset  of  T,  that  is,  if  every  member  of  S  is  also  a  member  of  T,  we  write 
S  c  T  or  T  d  S . 

The  empty  set  is  denoted  0  or  0.  There  are  two  ways  that  operations  such  as 
minimization  over  a  set  are  represented.  Specifically  we  write  either 

min  /(a)  or  min{f(x)  :ag5) 

x&S 

to  denote  the  minimum  value  of  /  over  the  set  S .  The  set  of  a’ s  in  S  that  achieve  the 
minimum  is  denoted  argmin  {/(a)  :  a  e  S }. 


©  Springer  International  Publishing  Switzerland  2016  495 

D.G.  Luenberger,  Y.  Ye,  Linear  and  Nonlinear  Programming ,  International 
Series  in  Operations  Research  &  Management  Science  228, 

DOI  10.1007/978-3-319-18842-3 


496 


A  Mathematical  Review 


Sets  of  Real  Numbers 

If  a  and  b  are  real  numbers,  [ a ,  b]  denotes  the  set  of  real  numbers  v  satisfying 
a  <  x  <  b.  A  rounded,  instead  of  square,  bracket  denotes  strict  inequality  in  the 
definition.  Thus  ( a ,  b]  denotes  all  v  satisfying  a  <  x  <  b. 

If  S  is  a  set  of  real  numbers  bounded  above,  then  there  is  a  smallest  real  number 
y  such  that  v  <  y  for  all  v  e  S .  The  number  y  is  called  the  least  upper  bound  or 
supremum  of  S  and  is  denoted 

sup(v)  or  sup{v  :  x  £  S}. 

xeS 

Similarly,  the  greatest  lower  bound  or  infimum  of  a  set  S  is  denoted 

inf(v)  or  infjv  :  x  £  S }. 

xeS 


A.2  Matrix  Notation 


A  matrix  is  a  rectangular  array  of  numbers,  called  elements.  The  matrix  itself  is 
denoted  by  a  boldface  letter.  When  specific  numbers  are  not  used,  the  elements  are 
denoted  by  italicized  lower-case  letters,  having  a  double  subscript.  Thus  we  write 


A  = 


an  a  12  m  m  m  <21 n 

<221  <222  •  •  •  <22 n 


L  <2/72 1  <2/772  <2 


mn  J 


for  a  matrix  A  having  m  rows  and  n  columns.  Such  a  matrix  is  referred  to  as  an 
m  x  n  matrix.  If  we  wish  to  specify  a  matrix  by  defining  a  general  element,  we  use 
the  notation  A  =  [ay] . 

An  m  x  n  matrix  all  of  whose  elements  are  zero  is  called  a  zero  matrix  and 
denoted  0.  A  square  matrix  (a  matrix  with  m  -  n)  whose  elements  are  ay  =  0  for 
i  A  j,  and  an  =  1  for  i  =  1, 2,  . . . ,  n  is  said  to  be  an  identity  matrix  and  denoted  I. 

The  sum  of  two  m  x  n  matrices  A  and  B  is  written  A  +  B  and  is  the  matrix  whose 
elements  are  the  sum  of  the  corresponding  elements  in  A  and  B.  The  product  of  a 
matrix  A  and  a  scalar  A,  written  AA  or  AT,  is  obtained  by  multiplying  each  element 
of  A  by  T.  The  product  AB  of  an  m  x  n  matrix  A  and  an  nx  p  matrix  B  is  the  m  x  p 
matrix  C  with  elements  ctj  =  Ylk=i  aikbkj- 

The  transpose  of  an  mxn  matrix  A  is  the  nxm  matrix  AT  with  elements  aj.  =  ay. 

A  (square)  matrix  A  is  symmetric  if  AT  =  A.  A  square  matrix  A  is  nonsingular  if 
there  is  a  matrix  A-1,  called  the  inverse  of  A,  such  that  A-1  A  =  I  =  AA-1.  The 
determinant  of  a  square  matrix  A  is  denoted  by  det  (A).  The  determinant  is  nonzero 
if  and  only  if  the  matrix  is  nonsingular.  Two  square  n  x  n  matrices  A  and  B  are 
similar  if  there  is  a  nonsingular  matrix  S  such  that  B  =  S-1  AS. 
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Matrices  having  a  single  row  are  referred  to  as  row  vectors ;  matrices  having  a 
single  column  are  referred  to  as  column  vectors.  Vectors  of  either  type  are  usually 
denoted  by  lower-case  boldface  letters.  To  economize  page  space,  row  vectors  are 
written  a  =  [ci\ ,  <22,  . . . ,  an\  and  column  vectors  are  written  a  =  (a\,  <22,  . . . ,  an). 
Since  column  vectors  are  used  frequently,  this  notation  avoids  the  necessity  to  dis¬ 
play  numerous  columns.  To  further  distinguish  rows  from  columns,  we  write  a  e  En 
if  a  is  a  column  vector  with  n  components,  and  we  write  b  e  En  if  b  is  a  row  vector 
with  n  components. 

It  is  often  convenient  to  partition  a  matrix  into  submatrices.  This  is  indicated  by 
drawing  partitioning  lines  through  the  matrix,  as  for  example, 


All  012 

013  014 

A  = 

021  022 

031  <232 

023  024 

033  034 

An  A12 

A21  A22 


The  resulting  submatrices  are  usually  denoted  A/7,  as  illustrated. 

A  matrix  can  be  partitioned  into  either  column  or  row  vectors,  in  which  case 
a  special  notation  is  convenient.  Denoting  the  columns  of  an  m  x  n  matrix  A  by 
ay-,  j  =  1,2,  . . . ,  n ,  we  write  A  =  [ai,  a2,  . . . ,  a„].  Similarly,  denoting  the  rows 
of  A  by  a',  i  =  1,2,  . . . ,  m,  we  write  A  =  (a1,  a2,  . . . ,  am).  Following  the  same 
pattern,  we  often  write  A  =  [B,  C]  for  the  partitioned  matrix  A  =  [B|C]. 


A.3  Spaces 

We  consider  the  ^-component  vectors  x  =  (x\,  X2,  . . . ,  xn)  as  elements  of  a  vector 
space.  The  space  itself,  ^-dimensional  Euclidean  space,  is  denoted  En.  Vectors  in  the 
space  can  be  added  or  multiplied  by  a  scalar,  by  performing  the  corresponding  op¬ 
erations  on  the  components.  We  write  x  ^  0  if  each  component  of  x  is  nonnegative. 

The  line  segment  connecting  two  vectors  x  and  y  is  denoted  [x,  y]  and  consists  of 
all  vectors  of  the  form  ax  +  (1  -  a) y  with  0  <  a  <  1 . 

The  scalar  product  of  two  vectors  x  =  (x\,  X2,  . . . ,  xn)  and  y  =  (yi,  y2,  •  •  • ,  yn) 
is  defined  as  xry  =  yrx  =  Y!U  \  *07 •  The  vectors  x  and  y  are  said  to  be  orthogonal  if 
xTy  =  0.  The  magnitude  or  norm  of  a  vector  x  is  |x|  =  (xrx)1/2.  For  any  two  vectors 
x  and  y  in  En ,  the  Cauchy -Schwarz  Inequality  holds:  |xry|  <  |x|  •  |y|. 

A  set  of  vectors  ai,  a2,  . . . ,  a k  is  said  to  be  linearly  dependent  if  there  are 
scalars  A\,  T2,  . . . ,  /U,  not  all  zero,  such  that  Yn=i  T/a*  -  0.  If  no  such  set  of  scalars 
exists,  the  vectors  are  said  to  be  linearly  independent.  A  linear  combination  of  the 
vectors  ai,  a2,  . . . ,  a^  is  a  vector  of  the  form  Xf=i  d/a,-.  The  set  of  vectors  that  are 
linear  combinations  of  ai,  a2,  . . . ,  a^  is  the  set  spanned  by  the  vectors.  A  linearly 
independent  set  of  vectors  that  span  En  is  said  to  be  a  basis  for  En.  Every  basis  for 
En  contains  exactly  n  vectors. 
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The  rank  of  a  matrix  A  is  equal  to  the  maximum  number  of  linearly  independent 
columns  in  A.  This  number  is  also  equal  to  the  maximum  number  of  linearly  inde¬ 
pendent  rows  in  A.  The  mx  n  matrix  A  is  said  to  be  of  full  rank  if  the  rank  of  A  is 
equal  to  the  minimum  of  m  and  n. 

A  subspace  M  of  En  is  a  subset  that  is  closed  under  the  operations  of  vector 
addition  and  scalar  multiplication;  that  is,  if  a  and  b  are  vectors  in  M,  then  da  +  pb 
is  also  in  M  for  every  pair  of  scalars  d,  p.  The  dimension  of  a  subspace  M  is  equal 
to  the  maximum  number  of  linearly  independent  vectors  in  M.  If  M  is  a  subspace  of 
En ,  the  orthogonal  complement  of  M ,  denoted  Mx,  consists  of  all  vectors  that  are 
orthogonal  to  every  vector  in  M.  The  orthogonal  complement  of  M  is  easily  seen  to 
be  a  subspace,  and  together  M  and  M1-  span  En  in  the  sense  that  every  vector  x  e  En 
can  be  written  uniquely  in  the  form  x  =  a  +  b  with  a  e  M,  b  e  M1- .  In  this  case  a 
and  b  are  said  to  be  the  orthogonal  projections  of  x  onto  the  subspaces  M  and  M '±, 
respectively. 

A  correspondence  A  that  associates  with  each  point  in  a  space  X  a  point  in  a  space 
Y  is  said  to  be  a  mapping  from  X  to  Y.  For  convenience  this  situation  is  symbolized 
by  A  :  X  — >  Y.  The  mapping  A  may  be  either  linear  or  nonlinear.  The  norm  of  linear 
mapping  A  is  defined  as  |A|  =  max  |Ax|.  It  follows  that  for  any  x,  |Ax|  <  |A|  •  |x|. 


A.4  Eigenvalues  and  Quadratic  Forms 

Corresponding  to  an  n  x  n  square  matrix  A,  a  scalar  A  and  a  nonzero  vector  x  satisfy¬ 
ing  the  equation  Ax  =  Ax  are  said  to  be,  respectively,  an  eigenvalue  and  eigenvector 
of  A.  In  order  that  A  be  an  eigenvalue  it  is  clear  that  it  is  necessary  and  sufficient  for 
A  -  Al  to  be  singular,  and  hence  det(A  -  dl)  =  0.  This  last  result,  when  expanded, 
yields  an  /ith-order  polynomial  equation  which  can  be  solved  for  n  (possibly  nondis- 
tinct)  complex  roots  d  which  are  the  eigenvalues  of  A. 

Now,  for  the  remainder  of  this  section,  assume  that  A  is  symmetric.  Then  the 
following  properties  hold: 

(i)  The  eigenvalues  of  A  are  real. 

(ii)  Eigenvectors  associated  with  distinct  eigenvalues  are  orthogonal. 

(iii)  There  is  an  orthogonal  basis  for  En,  each  element  of  which  is  an  eigenvector 
of  A. 

If  the  basis  Ui ,  112,  . . . ,  un  in  (iii)  is  normalized  so  that  each  element  has  magnitude 
unity,  then  defining  the  matrix  Q  =  [ui,  112,  . . . ,  u„]  we  note  that  QrQ  =  I  and 
hence  Qt  =  Q1.  A  matrix  with  this  property  is  said  to  be  an  orthogonal  matrix. 
Also,  we  observe,  in  this  case,  that 

Q  *AQ  =  QrAQ  =  Qr[Am,  Au2,  . . . ,  Au„] 

T 

[d|U]^,  d2U2,  •  •  •  •>  An\xn\. 


A. 5  Topological  Concepts 


499 


Thus 


and  therefore  A  is  similar  to  a  diagonal  matrix. 

A  symmetric  matrix  A  is  said  to  be  positive  definite  if  the  quadratic  form  xT  Ax  is 
positive  for  all  nonzero  vectors  x.  Similarly,  we  define  A  to  be  positive  semidefinite, 
negative  definite ,  or  negative  semidefinite  if  xrAx  ^  0,  <  0,  or  <  0  for  all  x.  The 
matrix  A  is  indefinite  if  xrAx  is  positive  for  some  x  and  negative  for  others. 

It  is  easy  to  obtain  a  connection  between  definiteness  and  the  eigenvalues  of  A. 
For  any  x  let  y  =  Q-1x  where  Q  is  defined  as  above.  Then  xrAx  =  yrQrAQy  = 
Y!l=\  Mf-  Since  the  yf  s  are  arbitrary  (since  x  is),  it  is  clear  that  A  is  positive  def¬ 
inite  (or  positive  semidefinite)  if  and  only  if  all  eigenvalues  of  A  are  positive  (or 
nonnegative). 

Through  diagonalization  we  can  also  easily  show  that  a  positive  semidefinite 
matrix  A  has  a  positive  semidefinite  (symmetric)  square  root  A1/2  satisfying  A1/2  • 
A 1/2  =  A.  For  this  we  use  Q  as  above  and  define 


which  is  easily  verified  to  have  the  desired  properties. 


A.5  Topological  Concepts 


A  sequence  of  vectors  xo,  xi,  . . . ,  x^,  . . .,  denoted  by  {xfc=o}£°,  or  if  the  index  set 
is  understood,  by  simply  {x^},  is  said  to  converge  to  the  limit  x  if  |x^  -  x|  — >  0  as 
k  — »  oo  (that  is,  if  given  s  >  0,  there  is  a  A  such  that  k  >  N  implies  |x^  -  x|  <  s).  If 
{XyJ  converges  to  x,  we  write  x^  — >  x  or  lim  x^  =  x. 

A  point  x  is  a  limit  point  of  the  sequence  {x^}  if  there  is  a  subsequence  of  {x^} 
convergent  to  x.  Thus  x  is  a  limit  point  of  {x^}  if  there  is  a  subset  7C  of  the  positive 
integers  such  that  {x^UeTc  converges  to  x. 

A  ball  ( sphere )  around  x  is  a  set  of  the  form  {y  :  |y  -  x|  <  (=)  s}  for  some  s  >  0. 
Such  a  ball  is  also  referred  to  as  the  neighborhood  of  x  of  radius  s. 

A  subset  S  of  En  is  open  if  around  every  point  in  S  there  is  a  sphere  that  is 
contained  in  S .  Equivalently,  S  is  open  if  given  x  e  S  there  is  an  s  >  0  such  that 
|y  -  x|  <  s  implies  y  e  S .  Thus  the  sphere  {x  :  |x|  <  1 }  is  open.  In  general,  open  sets 
can  be  characterized  as  sets  having  no  sharp  boundaries.  The  interior  of  any  set  S 
in  En  is  the  set  of  points  x  e  S  which  are  the  center  of  some  sphere  contained  in  S . 
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It  is  denoted  S .  The  interior  of  a  set  is  always  open;  indeed  it  is  the  largest  open  set 
contained  in  S .  The  interior  of  the  set  {x  :  |x|  <  1}  is  the  sphere  {x  :  |x|  <  1}. 

A  set  P  is  closed  if  every  point  that  is  arbitrarily  close  to  the  set  P  is  a  member 
of  P.  Equivalently,  P  is  closed  if  x^  — >  x  with  x^  e  P  implies  x  e  P.  Thus  the  set 
{x  :  |x|  <  1}  is  closed.  The  closure  of  any  set  P  in  En  is  the  smallest  closed  set 
containing  P.  It  is  denoted  S .  The  boundary  of  a  set  is  that  part  of  the  closure  that  is 
not  in  the  interior. 

A  set  is  compact  if  it  is  both  closed  and  bounded  (that  is,  if  it  is  closed  and  is  con¬ 
tained  within  some  sphere  of  finite  radius).  An  important  result,  due  to  Weierstrass, 
is  that  if  S  is  a  compact  set  and  {x^}  is  a  sequence  each  member  of  which  belongs 
to  S ,  then  {x^}  has  a  limit  point  in  S  (that  is,  there  is  subsequence  converging  to  a 
point  in  S). 

Corresponding  to  a  bounded  sequence  {r^}^0  of  real  numbers,  if  we  let  Sk  = 
sup {r;  :  i  >  k}  then  {sk}  converges  to  some  real  number  sQ.  This  number  is  called  the 
limit  superior  of  { }  and  is  denoted  lim(r^). 


A.6  Functions 


A  real- valued  function  /  defined  on  a  subset  of  En  is  said  to  be  continuous  at  x  if 
Xk  — >  x  implies  /(x^)  — >  /(x).  Equivalently,  /  is  continuous  at  x  if  given  s  >  0  there 
is  a  5  >  0  such  that  |y-x|  <  6  implies  |/(y)-/(x)|  <  s.  An  important  result  connected 
with  continuous  functions  is  a  theorem  of  Weierstrass :  A  continuous  function  / 
defined  on  a  compact  set  S  has  a  minimum  point  in  S ;  that  is,  there  is  an  x*  e  S 
such  that  for  all  x  e  S,  /(x)  ^  /(x*). 

A  set  of  real- valued  functions  /i,  fz,  . . . ,  fm  on  En  can  be  regarded  as  a  sin¬ 
gle  vector  function  f  =  (/i,  fz,  . . . ,  fm).  This  function  assigns  a  vector  f(x)  = 
(/i(x),  fz(x),  . . . ,  /m(x))  in  Em  to  every  vector  x  e  En.  Such  a  vector- valued  func¬ 
tion  is  said  to  be  continuous  if  each  of  its  component  functions  is  continuous. 

If  each  component  of  f  =  (/i,  fz,  . . . ,  fm)  is  continuous  on  some  open  set  of 
En,  then  we  write  f  e  C.  If  in  addition,  each  component  function  has  first  partial 
derivatives  which  are  continuous  on  this  set,  we  write  f  e  C1.  In  general,  if  the 
component  functions  have  continuous  partial  derivatives  of  order  p,  we  write  f  £  Cp. 

If  /  £  C1  is  a  real- valued  function  on  ET\  /(x)  =  f(x i,  xz,  . . . ,  xn\  we  define 
the  gradient  of  /  to  be  the  vector 


V/(x)  = 


d/(x)  dfix) 
dx\  ’  dx2 


d/(x) 

dxn 


We  sometimes  use  the  alternative  notation  /X(x)  for  V fix).  In  matrix  calculations 
the  gradient  is  considered  to  be  a  row  vector. 
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If  /  €  C2  then  we  define  the  Hessian  of  /  at  x  to  be  the  n  x  n  matrix  denoted 
/(x)  or  F(x)  as 


F(x)  = 


d2f(x ) 
dxidxj 


Since 

d2f  d2f 

dxidxj  dxjdxi  ’ 

it  is  easily  seen  that  the  Hessian  is  symmetric. 

For  a  vector- valued  function  f  =  (/i,  f2,  . . fm)  the  situation  is  similar.  If 
f  €  C1,  the  first  derivative  is  defined  as  the  m  x  n  matrix 


Vf  (x)  = 


d/Kx) 

dxj 


If  f  e  C2  it  is  possible  to  define  the  m  Hessians  Fi(x),  F2(x),  . . . ,  F,„(x)  corre- 
sponding  to  the  m  component  functions.  The  second  derivative  itself,  for  a  vector 
function,  is  a  third-order  tensor  but  we  do  not  require  its  use  explicitly.  Given  any 
A  =  [di,  A2,  . . . ,  Am\  e  Em,  we  note,  however,  that  the  real- valued  function  A  f 

T1 

has  gradient  equal  to  A  Vf(x)  and  Hessian,  denoted  A  F(x),  equal  to 


VF(x)  =  2  d,F,(x). 

1=1 

Also  see  Sect.  7.4  for  a  discussion  of  convex  functions. 


Taylor’s  Theorem 

A  group  of  results  that  are  used  frequently  in  analysis  are  referred  to  under  the 
general  heading  of  Taylor’s  Theorem  or  Mean  Value  Theorems.  If  /  e  C1  in  a 
region  containing  the  line  segment  [xi,  X2],  then  there  is  a  6,  0  <  6  <  1  such  that 

f  (x2)  =  /(x  1)  +  V/(0x  1  +  (1  -  0)x2)(x2  -  xi). 

Furthermore,  if  /  e  C2  then  there  is  a  6,  0  <  6  <  1  such  that 

f  (x2)  =  /(x  1)  +  V/(xO(x2  -  xi) 

+  i(x2  -  xi)rF(6xi  +  (1  -  6)x2)(x2  -  Xi), 


where  F  denotes  the  Hessian  of  /. 
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Implicit  Function  Theorem 

Suppose  we  have  a  set  of  m  equations  in  n  variables 

hi(x)  =  0,  i  =  1,2,  . . . ,  m. 

The  implicit  function  theorem  addresses  the  question  as  to  whether  if  n  -  m  of  the 
variables  are  fixed,  the  equations  can  be  solved  for  the  remaining  m  variables.  Thus 
selecting  m  variables,  say  x\,  X2,  . . . ,  xm,  we  wish  to  determine  if  these  may  be 
expressed  in  terms  of  the  remaining  variables  in  the  form 

%i  —  0/(Am+l>  X?n+2i  •  •  •  »  Xn)->  l  —  1?  2,  .  .  .  ,  YYl. 

The  functions  0;,  if  they  exist,  are  called  implicit  functions. 

Theorem.  Let  x°  =  (xj,  x®,  . . . ,  x{)n)  be  a  point  in  En  satisfying  the  properties: 

i)  The  functions  hi  e  Cp,  i  =  1,2,  . . . ,  m  in  some  neighborhood  ofx° ,  for  some  p  >  1. 

ii)  hi(x°)  =  0,  i  =  l,2,  . . . ,  m. 

iii)  The  m  x  m  Jacobian  matrix 

dhi(x°)  dh\ (x°) 

dx\  dxm 

J  =  :  : 

d/j„,(x°)  _  _  _  dhm(x°) 

Ox  |  0xm 

is  nonsingular. 

Then  there  is  a  neighborhood  of  xf  =  (x^+1,  x{)n)  e  En~m  such  that  for  x  = 

(xm+i,  xm+2,  . . . ,  xn)  in  this  neighborhood  there  are  functions  ffx),  i  -  1, 2,  . . . ,  m  such 
that 

i)  fte  Oh 

ii)  x 9  =  0,(x°),  i  =  1,2,  . . . ,  m. 

iii)  ht((pi(x),  02  (x),  0m(x),x)  =  0,  i  =  1,2,  . ..,  m. 

Example  1.  Consider  the  equation  +  V2  =  0.  A  solution  is  vj  =  0,  V2  =  0. 
However,  in  a  neighborhood  of  this  solution  there  is  no  function  0  such  that  xi  = 
0(x2).  At  this  solution  condition  (iii)  of  the  implicit  function  theorem  is  violated.  At 
any  other  solution,  however,  such  a  0  exists. 

Example  2.  Let  A  be  an  m  x  n  matrix  ( m  <  n)  and  consider  the  system  of  linear 
equations  Ax  =  b.  If  A  is  partitioned  as  A  =  [B,  C]  where  BismXm  then  condition 
(iii)  is  satisfied  if  and  only  if  B  is  nonsingular.  This  condition  corresponds,  of  course, 
exactly  with  what  the  theory  of  linear  equations  tells  us.  In  view  of  this  example,  the 
implicit  function  can  be  regarded  as  a  nonlinear  generalization  of  the  linear  theory. 
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If  g  is  a  real- valued  function  of  a  real  variable,  the  notation  g(v)  =  O(x)  means  that 
g(v)  goes  to  zero  at  least  as  fast  as  v  does.  More  precisely,  it  means  that  there  is  a 
K  >  0  such  that 


g(x) 

x 


<  K  as  v 


0. 


The  notation  g(v)  =  o(x)  means  that  g(v)  goes  to  zero  faster  than  v  does;  or  equiva¬ 
lently,  that  K  above  is  zero. 


Appendix  B 

Convex  Sets 


B.l  Basic  Definitions 

Concepts  related  to  convex  sets  so  dominate  the  theory  of  optimization  that  it  is 
essential  for  a  student  of  optimization  to  have  knowledge  of  their  most  fundamental 
properties.  In  this  appendix  is  compiled  a  brief  summary  of  the  most  important  of 
these  properties. 

Definition.  A  set  C  in  En  is  said  to  be  convex  if  for  every  xi ,  X2  e  C  and  every  real  number 

a,  0  <  a  <  1,  the  point  orxi  +  (1  -  a)x2  e  C. 

This  definition  can  be  interpreted  geometrically  as  stating  that  a  set  is  convex  if, 
given  two  points  in  the  set,  every  point  on  the  line  segment  joining  these  two  points 
is  also  a  member  of  the  set.  This  is  illustrated  in  Fig.  B.l. 

The  following  proposition  shows  that  certain  familiar  set  operations  preserve 
convexity. 

Proposition  1.  Convex  sets  in  En  satisfy  the  following  relations: 

i)  If  C  is  a  convex  set  and  [5  is  a  real  number,  the  set 

J3C  =  {x  :  x  =  J3c,  c  e  C} 

is  convex. 

ii)  If  C  and  D  are  convex  sets,  then  the  set 

C  +  Z)  =  {x:x  =  c  +  d,  ceC,  deD} 

is  convex. 

iii)  The  intersection  of  any  collection  of  convex  sets  is  convex. 

The  proofs  of  these  three  properties  follow  directly  from  the  definition  of  a  con¬ 
vex  set  and  are  left  to  the  reader.  The  properties  themselves  are  illustrated  in  Fig.  B.2. 
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Another  important  concept  is  that  of  forming  the  smallest  convex  set  containing 
a  given  set. 


Fig.  B.l  Convexity 


Fig.  B.2  Properties  of  convex  sets 


Definition.  Let  S  be  a  subset  of  E'\  The  convex  hull  of  S ,  denoted  00(5),  is  the  set  which 
is  the  intersection  of  all  convex  sets  containing  S .  The  closed  convex  hull  of  S  is  defined  as 
the  closure  of  00(5). 

Finally,  we  conclude  this  section  by  defining  a  cone  and  a  convex  cone.  A  convex 
cone  is  a  special  kind  of  convex  set  that  arises  quite  frequently. 

Definition.  A  set  C  is  a  cone  if  x  e  C  implies  ax  e  C  for  all  a  >  0.  A  cone  that  is  also 
convex  is  a  convex  cone. 

Some  cones  are  shown  in  Fig.  B.3.  Their  basic  property  is  that  if  a  point  x  belongs 
to  a  cone,  then  the  entire  half  line  from  the  origin  through  the  point  (but  not  the 
origin  itself)  also  must  belong  to  the  cone. 
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Fig.  B.3  Cones 

B.2  Hyperplanes  and  Polytopes 

The  most  important  type  of  convex  set  (aside  from  single  points)  is  the  hyperplane. 
Hyperplanes  dominate  the  entire  theory  of  optimization,  appearing  under  the  guise 
of  Lagrange  multipliers,  duality  theory,  or  gradient  calculations. 

The  most  natural  definition  of  a  hyperplane  is  the  logical  generalization  of  the 
geometric  properties  of  a  plane  in  three  dimensions.  We  start  by  giving  this  geo¬ 
metric  definition.  For  computations  and  for  a  concrete  description  of  hyperplanes, 
however,  there  is  an  equivalent  algebraic  definition  that  is  more  useful.  A  major 
portion  of  this  section  is  devoted  to  establishing  this  equivalence. 

Definition.  A  set  V  in  En  is  said  to  be  a  linear  variety ,  if,  given  any  xi ,  X2  e  V,  we  have 

Axi  +  (1  -  A)x2  g  V  for  all  real  numbers  A. 

Note  that  the  only  difference  between  the  definition  of  a  linear  variety  and  a 
convex  set  is  that  in  a  linear  variety  the  entire  line  passing  through  any  two  points, 
rather  than  simply  the  line  segment  between  them,  must  lie  in  the  set.  Thus  in  three 
dimensions  the  nonempty  linear  varieties  are  points,  lines,  two-dimensional  planes, 
and  the  whole  space.  In  general,  it  is  clear  that  we  may  speak  of  the  dimension  of  a 
linear  variety.  Thus,  for  example,  a  point  is  a  linear  variety  of  dimension  zero  and 
a  line  is  a  linear  variety  of  dimension  one.  In  the  general  case,  the  dimension  of 
a  linear  variety  in  En  can  be  found  by  translating  it  (moving  it)  so  that  it  contains 
the  origin  and  then  determining  the  dimension  of  the  resulting  set,  which  is  then  a 
subspace  of  En . 

Definition.  A  hyperplane  in  En  is  an  ( n  -  1) -dimensional  linear  variety. 

We  see  that  hyperplanes  generalize  the  concept  of  a  two-dimensional  plane  in 
three-dimensional  space.  They  can  be  regarded  as  the  largest  linear  varieties  in  a 
space,  other  than  the  entire  space  itself. 

We  now  relate  this  abstract  geometric  definition  to  an  algebraic  one. 
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Proposition  2.  Let  si  be  a  nonzero  n-dimensional  column  vector,  and  let  c  be  a  real  number. 

The  set 

H  -  {x  e  En  :  arx  =  c) 

is  a  hyperplane  in  En. 

Proof.  It  follows  directly  from  the  linearity  of  the  equation  arx  =  c  that  H  is  a  linear 
variety.  Let  xi  be  any  vector  in  H.  Translating  by  -xi  we  obtain  the  set  M  =  H  -  xi 
which  is  a  linear  subspace  of  En.  This  subspace  consists  of  all  vectors  x  satisfying 
arx  =  0;  in  other  words,  all  vectors  orthogonal  to  a.  This  is  clearly  an  (n  -  1)- 
dimensional  subspace.  I 

Proposition  3.  Let  H  be  a  hyperplane  in  En.  Then  there  is  a  nonzero  n-  dimensional  vector 
and  a  constant  c  such  that 

H  =  {x  g  En  :  arx  =  c). 


Proof  Let  xi  e  H  and  translate  by  -x\  obtaining  the  set  M  =  H  -  xi.  Since  H  is  a 
hyperplane,  M  is  an  (n-  l)-dimensional  subspace.  Let  a  be  any  nonzero  vector  that  is 
orthogonal  to  this  subspace,  that  is,  a  belongs  to  the  one-dimensional  subspace  M ±. 
Clearly  M  =  {x  :  arx  =  0}.  Letting  c  =  arxi  we  see  that  if  X2  e  H  we  have  X2  -  xi  e 
M  and  thus  arX2  -  arxi  =  0  which  implies  arX2  =  c.  Thus  H  c  {x  :  arx  =  c }.  Since 
H  is,  by  definition,  of  dimension  n  —  1  and  {x  :  arx  =  c }  is  of  dimension  n  -  1  by 
Proposition  2,  these  two  sets  must  be  equal.  I 

Combining  Propositions  2  and  3,  we  see  that  a  hyperplane  is  the  set  of  solutions 
to  a  single  linear  equation.  This  is  illustrated  in  Fig.  B.4.  We  now  use  hyperplanes 
to  build  up  other  important  classes  of  convex  sets. 

Definition.  Let  a  be  a  nonzero  vector  in  En  and  let  c  be  a  real  number.  Corresponding  to 

the  hyperplane  H  =  {x  :  arx  =  c)  are  the  positive  and  negative  closed  half  spaces 

H+  =  {x  :  arx  >  c } 

H-  =  {x  :  arx  <  c } 

and  the  positive  and  negative  open  half  spaces 

H+  =  {x  :  arx  >  c } 

O  rji 

//_  =  {x  :  a  x  <  c}. 

It  is  easy  to  see  that  half  spaces  are  convex  sets  and  that  the  union  of  H+  and  //_ 
is  the  whole  space. 

Definition.  A  set  which  can  be  expressed  as  the  intersection  of  a  finite  number  of  closed 

half  spaces  is  said  to  be  a  convex  polytope. 

We  see  that  convex  polytopes  are  the  sets  obtained  as  the  family  of  solutions  to  a 
set  of  linear  inequalities  of  the  form 
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a 


Fig.  B.4  Hyperplane 


Fig.  B.5  Polytopes 


afx  <  b\ 
a^x  <  b2 


since  each  individual  inequality  defines  a  half  space  and  the  solution  family  is  the 
intersection  of  these  half  spaces.  (If  some  a,  =  0,  the  resulting  set  can  still,  as  the 
reader  may  verify,  be  expressed  as  the  intersection  of  a  finite  number  of  half  spaces.) 

Several  polytopes  are  illustrated  in  Fig.  B.5.  We  note  that  a  polytope  may  be 
empty,  bounded,  or  unbounded.  The  case  of  a  nonempty  bounded  polytope  is  of 
special  interest  and  we  distinguish  this  case  by  the  following. 

Definition.  A  nonempty  bounded  polytope  is  called  a  polyhedron. 


B.3  Separating  and  Supporting  Hyperplanes 

The  two  theorems  in  this  section  are  perhaps  the  most  important  results  related  to 
convexity.  Geometrically,  the  first  states  that  given  a  point  outside  a  convex  set,  a 
hyperplane  can  be  passed  through  the  point  that  does  not  touch  the  convex  set.  The 
second,  which  is  a  limiting  case  of  the  first,  states  that  given  a  boundary  point  of  a 
convex  set,  there  is  a  hyperplane  that  contains  the  boundary  point  and  contains  the 
convex  set  on  one  side  of  it. 
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Theorem  1.  Let  C  be  a  convex  set  and  let  y  be  a  point  exterior  to  the  closure  of  C.  Then 
there  is  a  vector  a  such  that  ary  <  inf  arx. 

xeC 


Proof.  Let 


8  =  inf  |x  —  y|  >  0. 

xeC  J 


There  is  an  xo  on  the  boundary  of  C  such  that  |xo  -  y|  =  8.  This  follows  because 
the  continuous  function  /(x)  =  |x  -  y|  achieves  its  minimum  over  any  closed  and 
bounded  set  and  it  is  clearly  only  necessary  to  consider  x  in  the  intersection  of  the 
closure  of  C  and  the  sphere  of  radius  28  centered  at  y. 

We  shall  show  that  setting  a  =  xo  -  y  satisfies  the  conditions  of  the  theorem.  Let 
x  e  C.  For  any  a,  0  <  a  <  1,  the  point  xq  +  a(x  -  xq)  €  C  and  thus 


|x0  +  a(x  -  x0)  -  y|2  >  |x0  -  y|2. 


Expanding, 

2a(xo  -  y)T(x  -  xo)  +  a2\x  -  xo|2  >  0. 
Thus,  considering  this  as  a  — >  0+,  we  obtain 

(Xo  -  y)r(x  -  X0)  >  0 


or, 


(x0  -  y)rx  >  (xo  -  y)rx0  =  (x0  -  y)ry  +  (x0  -  y)r(x0  -  y) 

=  (xo  -  y)Ty  +  §2- 


Setting  a  =  xo  -  y  proves  the  theorem.  I 

The  geometrical  interpretation  of  Theorem  1  is  that,  given  a  convex  set  C  and  a 
point  y  exterior  to  the  closure  of  C,  there  is  a  hyperplane  containing  y  that  contains 
C  in  one  of  its  open  half  spaces.  We  can  easily  extend  this  theorem  to  include  the 
case  where  y  is  a  boundary  point  of  C. 

Theorem  2.  Let  C  be  a  convex  set  and  let  y  be  a  boundary  point  of  C.  Then  there  is  a 
hyperplane  containing  y  and  containing  C  in  one  of  its  closed  half  spaces. 


Proof.  Let  {y*}  be  a  sequence  of  vectors,  exterior  to  the  closure  of  C,  converging 
to  y.  Let  {a^}  be  the  sequence  of  corresponding  vectors  constructed  according  to 
Theorem  1,  normalized  so  that  |a&|  =  1,  such  that 


yk  <  inf  a 

xeC 


Since  {a^}  is  a  bounded  sequence,  it  has  a  convergent  subsequence  {a^},  k  e  7C  with 
limit  a.  For  this  vector  we  have  for  any  x  e  C. 
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Definition.  A  hyperplane  containing  a  convex  set  C  in  one  of  its  closed  half  spaces  and 
containing  a  boundary  point  of  C  is  said  to  be  a  supporting  hyperplane  of  C. 

In  terms  of  this  definition,  Theorem  2  says  that,  given  a  convex  set  C  and  a 
boundary  point  y  of  C,  there  is  a  hyperplane  supporting  C  at  y. 

It  is  useful  in  the  study  of  convex  sets  to  consider  the  relative  interior  of  a  convex 
set  C  defined  as  the  largest  subset  of  C  that  contains  no  boundary  points  of  C. 

Another  variation  of  the  theorems  of  this  section  is  the  one  that  follows,  which  is 
commonly  known  as  the  Separating  Hyperplane  Theorem. 

Theorem  3.  Let  B  and  C  be  convex  sets  with  no  common  relative  interior  points.  (That  is 
the  only  common  points  are  boundary  points.)  Then  there  is  a  hyperplane  separating  B  and 
D.  bn  particular,  there  is  a  nonzero  vector  a  such  that  sup^^  a  b  ^  infceC  arc. 

Proof.  Consider  the  set  G  =  C  -  B.  It  is  easily  shown  that  G  is  convex  and  that  0  is 
not  a  relative  interior  point  of  G.  Hence,  Theorem  1  or  Theorem  2  applies  and  gives 
the  appropriate  hyperplane.  1 


B.4  Extreme  Points 


Definition.  A  point  x  in  a  convex  set  C  is  said  to  be  an  extreme  point  of  C  if  there  are  no 
two  distinct  points  xi  and  x2  in  C  such  that  x  =  ax\  +  (1  -  a)x2  for  some  a,  0  <  a  <  1. 

For  example,  in  E2  the  extreme  points  of  a  square  are  its  four  corners;  the  extreme 
points  of  a  circular  disk  are  all  points  on  the  boundary.  Note  that  a  linear  variety 
consisting  of  more  than  one  point  has  no  extreme  points. 

Lemma  1.  Let  C  be  a  convex  set,  H  a  supporting  hyperplane  of  C,  and  T  the  intersection 
of  H  and  C.  Every  extreme  point  ofT  is  an  extreme  point  of  C. 

Proof  Suppose  xo  e  T  is  not  an  extreme  point  of  C.  Then  xo  =  ax i  +  (1  -  cr)x 2  for 
some  xi,  x2  e  C,  xi  A  x2, 0  <  a  <  1.  Let  H  be  described  as  H  =  {x  :  arx  =  c }  with 
C  contained  in  its  closed  positive  half  space.  Then 

arxi  >  c,  arx2  >  c. 


But,  since  xo  e  //, 

c  =  arxo  =  aaTx\  +  (1  -  a) arx2, 

and  thus  xi  and  x2  e  H.  Hence  xi,  x2  e  T  and  xo  is  not  an  extreme  point  of  T .  I 

Theorem  4.  A  closed  bounded  convex  set  in  En  is  equal  to  the  closed  convex  hull  of  its 
extreme  points. 

Proof  The  proof  is  by  induction  on  the  dimension  of  the  space  En.  The  statement 
is  easily  seen  to  be  true  for  n  -  1 .  Suppose  that  it  is  true  for  n  -  1 .  Let  C  be  a  closed 
bounded  convex  set  in  En,  and  let  K  be  the  closed  convex  hull  of  the  extreme  points 
of  C.  We  wish  to  show  that  K  =  C. 
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Assume  there  is  y  e  C  y  £  K.  Then  by  Theorem  1,  Sect.  B.3,  there  is  a  hyperplane 
separating  y  and  K\  that  is,  there  is  a  ^  0,  such  that  ary  <  infXG^arx.  Let  Co  = 
inf(arx).  The  number  Co  is  finite  and  there  is  anxo  €  C  for  which  arxo  =  Co,  because 

xeC 

by  Weierstrass’  Theorem,  the  continuous  function  arx  achieves  its  minimum  over 
any  closed  bounded  set.  Thus  the  hyperplane  H  =  {x  :  arx  =  Co}  is  a  supporting 

'T 

hyperplane  to  C.  It  is  disjoint  from  K  since  Co  <  inf  (a  x). 

xeK 

Let  T  -  H  Pi  C.  Then  T  is  a  bounded  closed  convex  subset  of  H  which  can  be 
regarded  as  a  space  of  dimension  n  -  1.  T  is  nonempty,  since  it  contains  xo.  Thus, 
by  the  induction  hypothesis,  T  contains  extreme  points;  and  by  Lemma  1  these  are 
also  extreme  points  of  C.  Thus  we  have  found  extreme  points  of  C  not  in  K ,  which 
is  a  contradiction.  I 

Let  us  investigate  the  implications  of  this  theorem  for  convex  polyhedra.  We 
recall  that  a  convex  polyhedron  is  a  bounded  polytope.  Being  the  intersection  of 
closed  half  spaces,  a  convex  polyhedron  is  also  closed.  Thus  any  convex  polyhedron 
is  the  closed  convex  hull  of  its  extreme  points.  It  can  be  shown  (see  Sect.  2.5)  that 
any  polytope  has  at  most  a  finite  number  of  extreme  points  and  hence  a  convex 
polyhedron  is  equal  to  the  convex  hull  of  a  finite  number  of  points.  The  converse 
can  also  be  established,  yielding  the  following  two  equivalent  characterizations. 

Theorem  5.  A  convex  polyhedron  can  he  described  either  as  a  bounded  intersection  of  a 
finite  number  of  closed  half  spaces,  or  as  the  convex  hull  of  a  finite  number  of  points. 


Appendix  C 

Gaussian  Elimination 


This  appendix  describes  the  method  for  solving  systems  of  linear  equations  that  has 
proved  to  be,  not  only  the  most  popular,  but  also  the  fastest  and  least  susceptible 
to  round-off  error  accumulation — the  method  of  Gaussian  elimination.  Attention  is 
directed  toward  explaining  this  classical  elimination  technique  itself  and  its  relation 
to  the  theory  of  LU  decomposition  of  a  non-singular  square  matrix. 

We  first  note  how  easily  triangular  systems  of  equations  can  be  solved.  Thus  the 
system 

a\\X\  =  b\ 

<221X1  +  <222X2  =  b2 

an\x\  +  cLn2x2  +  *  *  *  +  annxn  —  bn 
can  be  solved  recursively  as  follows: 

xi  =  bi/au 

x2  =  (b2  ~  a2\xi)/a22 

Xft  =  {bn  tln\X\  —  Cln2X2  ...  —  Clnn-\Xn-\) / ann, 

provided  that  each  of  the  diagonal  terms  an,  i  -  1,2,  . . . ,  n  is  nonzero  (as  they 
must  be  if  the  system  is  nonsingular).  This  observation  motivates  us  to  attempt  to 
reduce  an  arbitrary  system  of  equations  to  a  triangular  one. 

Definition.  A  square  matrix  C  =  [Cy]  is  said  to  be  lower  triangular  if  Cy  =  0  for 
i  <  j.  Similarly,  C  is  said  to  be  upper  triangular  if  Cy  =  0  for  i  >  j. 

In  matrix  notation,  the  idea  of  Gaussian  elimination  is  to  somehow  find  a  decom¬ 
position  of  a  given  nxn  matrix  A  in  the  form  A  =  LU  where  L  is  a  lower  triangular 
and  U  an  upper  triangular  matrix.  The  system 
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Ax  =  b 


(C.l) 


can  then  be  solved  by  solving  the  two  triangular  systems 


Ly  =  b,  Ux  =  y. 


(C.2) 


The  calculation  of  L  and  U  together  with  solution  of  the  first  of  these  systems  is 
usually  referred  to  as  forward  elimination ,  while  solution  of  the  second  triangular 
system  is  called  back  substitution. 

Every  nonsingular  square  matrix  A  has  an  LU  decomposition,  provided  that  int¬ 
erchanges  of  rows  of  A  are  introduced  if  necessary.  This  interchange  of  rows  cor¬ 
responds  to  a  simple  reordering  of  the  system  of  equations,  and  hence  amounts  to 
no  loss  of  generality  in  the  method.  For  simplicity  of  notation,  however,  we  assume 
that  no  such  interchanges  are  required. 

We  turn  now  to  the  problem  of  explicitly  determining  L  and  U,  by  elimination, 
for  a  nonsingular  matrix  A.  Given  the  system,  we  attempt  to  transform  it  so  that 
zeros  appear  below  the  main  diagonal.  Assuming  that  an  ^  0  we  subtract  multiples 
of  the  first  equation  from  each  of  the  others  in  order  to  get  zeros  in  the  first  column 
below  an  -  If  we  define  mm  =  am  /an  and  let 


Mi  = 


1 

-m2 1  1 
~m3 1  1 


mn  i 


1 


the  resulting  new  system  of  equations  can  be  expressed  as 


A(2)x  =  b(2) 


with 

A(2)  =  MiA,  b(2)  =  Mib. 

The  matrix  A'2’  =  |c/2||  has  a[2)  =0,  k  >  1. 

(2) 

Next,  assuming  a22  ±  0,  multiples  of  the  second  equation  of  the  new  system 

(2) 

are  subtracted  from  equations  3  through  n  to  yield  zeros  below  a22  in  the  second 
column.  This  is  equivalent  in  premultiplying  A(2)  and  b(2)  by 


1 

0 

0 

1 

• 

—m32  1 

• 

-m42 

• 

• 

• 

• 

-mn  2 

1 


where  2  =  af2  /a{22.  This  yields  A(3)  =  M2A(2)  and  b(3)  =  M2A(2). 
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Proceeding  in  this  way  we  obtain  A(77)  =  Mn_iM„_2  . . .  Mi  A,  an  upper  triangular 
matrix  which  we  denote  by  U.  The  matrix  M  =  . . .  Mi  is  a  lower  trian¬ 

gular  matrix,  and  since  MA  =  U  we  have  A  =  M_1U.  The  matrix  L  =  M-1  is  also 
lower  triangular  and  becomes  the  L  of  the  desired  LU  decomposition  for  A. 

The  representation  for  L  can  be  made  more  explicit  by  noting  that  M^1  is  the 
same  as  M^  except  that  the  off-diagonal  terms  have  the  opposite  sign.  Furthermore, 
we  have  L  =  M-1  =  M^M^ 1 . . .  M“_^  which  is  easily  verified  to  be 


L  = 


1  0 

m2\  1 

m3 1  m32  1 


_Tnn\  Wln2 


1 


Hence  L  can  be  evaluated  directly  in  terms  of  the  calculations  required  by  the  elim¬ 
ination  process.  Of  course,  an  explicit  representation  for  M  =  L_1  would  actually 
be  more  useful  but  a  simple  representation  for  M  does  not  exist.  Thus  we  content 
ourselves  with  the  explicit  representation  for  L  and  use  it  in  (C.2). 

If  the  original  system  (C.l)  is  to  be  solved  for  a  single  b  vector,  the  vector  y 
satisfying  Ly  =  b  is  usually  calculated  simultaneously  with  L  in  the  form  y  = 
b(,?)  =  Mb.  The  final  solution  x  is  then  found  by  a  single  back  substitution,  from 
Ux  =  y.  Once  the  LU  decomposition  of  A  has  been  obtained,  however,  the  solution 
corresponding  to  any  right-hand  side  can  be  found  by  solving  the  two  systems  (C.2). 

In  practice,  the  diagonal  element  of  A^  may  become  zero  or  very  close 
to  zero.  In  this  case  it  is  important  that  the  Mi  row  be  interchanged  with  a  row 
that  is  below  it.  Indeed,  for  considerations  of  numerical  accuracy,  it  is  desirable  to 
continuously  introduce  row  interchanges  of  this  type  in  such  a  way  to  insure  |m^-|  <  1 
for  all  /,  j.  If  this  is  done,  the  Gaussian  elimination  procedure  has  exceptionally 
good  stability  properties. 


Appendix  D 

Basic  Network  Concepts 


This  appendix  describes  some  of  the  basic  graph  and  network  terminology  and 
concepts  necessary  for  the  development  of  this  alternative  approach. 

A  graph  consists  of  a  finite  collection  of  elements  called  nodes  together  with  a 
subset  of  unordered  pairs  of  the  nodes  called  arcs.  The  nodes  of  a  graph  are  usually 
numbered,  say,  1, 2, 3, . . . ,  n.  An  arc  between  nodes  i  and  j  is  then  represented  by 
the  unordered  pair  (/,  j).  A  graph  is  typically  represented  as  shown  in  Fig.  D.l.  The 
nodes  are  designated  by  circles,  with  the  number  inside  each  circle  denoting  the 
index  of  that  node.  The  arcs  are  represented  by  the  lines  between  the  nodes. 


Fig.  D.l  A  graph 


There  are  a  number  of  other  elementary  definitions  associated  with  graphs  that 
are  useful  in  describing  their  structure.  A  chain  between  nodes  i  and  j  is  a  sequence 
of  arcs  connecting  them.  The  sequence  must  have  the  form  (/,  k\),  (k\,  k 2),  (£2,  £3), 

. . . ,  (km,  j ).  In  Fig.  D.l,  (1,  2),  (2,  4),  (4,  3)  is  a  chain  between  nodes  1  and  3.  If  a 
direction  of  movement  along  a  chain  is  specified — say  from  node  i  to  node  j — it  is 
then  called  a  path  from  i  to  j.  A  cycle  is  a  chain  leading  from  node  i  back  to  node  i. 
The  chain  (1,2),  (2,  4),  (4,  3),  (3,  1)  is  a  cycle  for  the  graph  in  Fig.  D.l. 


©  Springer  International  Publishing  Switzerland  2016 

D.G.  Luenberger,  Y.  Ye,  Linear  and  Nonlinear  Programming ,  International 

Series  in  Operations  Research  &  Management  Science  228, 

DOI  10.1007/978-3-319-18842-3 


517 


518 


D  Basic  Network  Concepts 


A  graph  is  connected  if  there  is  a  chain  between  any  two  nodes.  Thus,  the  graph 
of  Fig.  D.l  is  connected.  A  graph  is  a  tree  if  it  is  connected  and  has  no  cycles. 
Removal  of  any  one  of  the  arcs  (1,  2),  (1,  3),  (2,  4),  (3,  4)  would  transform  the  graph 
of  Fig.  D.l  into  a  tree.  Sometimes  we  consider  a  tree  within  a  graph  G,  which  is  just 
a  tree  made  up  of  a  subset  of  arcs  from  G.  Such  a  tree  is  a  spanning  tree  if  it  touches 
all  nodes  of  G.  It  is  easy  to  see  that  a  graph  is  connected  if  and  only  if  it  contains  a 
spanning  tree. 

In  directed  graphs  a  sense  of  orientation  is  given  to  each  arc.  In  this  case  an 
arc  is  considered  to  be  an  ordered  pair  of  nodes  (/,  j),  and  we  say  that  the  arc  is 
from  node  i  to  node  j.  This  is  indicated  on  the  graph  by  having  an  arrow  on  the  arc 
pointing  from  i  to  j  as  shown  in  Fig.  D.2.  When  working  with  directed  graphs,  some 
node  pairs  may  have  an  arc  in  both  directions  between  them.  Rather  than  explicitly 
indicating  both  arcs  in  such  a  case,  it  is  customary  to  indicate  a  single  undirected 
arc.  The  notions  of  paths  and  cycles  can  be  directly  applied  to  directed  graphs.  In 
addition  we  say  that  node  j  is  reachable  from  i  if  there  is  a  path  from  node  i  to  j. 

In  addition  to  the  visual  representation  of  a  directed  graph  characterized  by 
Fig.  D.2,  another  common  method  of  representation  is  in  terms  of  a  graph’s  node¬ 
arc  incidence  matrix.  This  is  constructed  by  listing  the  nodes  vertically  and  the  arcs 
horizontally.  Then  in  the  column  under  arc  (/,  j),  a  +1  is  placed  in  the  position  cor¬ 
responding  to  node  i  and  a  - 1  is  placed  in  the  position  corresponding  to  node  j.  The 
incidence  matrix  for  the  graph  of  Fig.  D.2  is  shown  in  Table  D.l. 


Fig.  D.2  A  directed  graph 


(1.2) 

(1.4) 

(2,3) 

(2,4) 

(4,2) 

1 

1 

1 

2 

-1 

1 

1 

-1 

3 

-1 

4 

-1 

-1 

1 

Table  D.l  Incidence  matrix  for  example 
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Clearly,  all  information  about  the  structure  of  the  graph  is  contained  in  the  node¬ 
arc  incidence  matrix.  This  representation  is  often  very  useful  for  computational  pur¬ 
poses,  since  it  is  easily  stored  in  a  computer. 


D.l  Flows  in  Networks 

A  graph  is  an  effective  way  to  represent  the  communication  structure  between  nodes. 
When  there  is  the  possibility  of  flow  along  the  arcs,  we  refer  to  the  directed  graph  as 
a  network.  In  applications  the  network  might  represent  a  transportation  system  or  a 
communication  network,  or  it  may  simply  be  a  representation  used  for  mathematical 
purposes  (such  as  in  the  assignment  problem). 

A  flow  in  a  given  directed  arc  (/,  j)  is  a  number  xij  >  0.  Flows  in  the  arcs  of 
the  network  must  jointly  satisfy  a  conservation  criterion  at  each  node.  Specifically, 
unless  the  node  is  a  source  or  sink  as  discussed  below,  flow  cannot  be  created  or  lost 
at  a  node;  the  total  flow  into  a  node  must  equal  the  total  flow  out  of  the  node.  Thus 
at  each  such  node  i 

n  n 

XiJ  ~  Xki  =  O' 

7=1  k= 1 

The  first  sum  is  the  total  flow  from  /,  and  the  second  sum  is  the  total  flow  to  i. 
(Of  course  xij  does  not  exist  if  there  is  no  arc  from  i  to  j.)  It  should  be  clear  that 
for  nonzero  flows  to  exist  in  a  network  without  sources  or  sinks,  the  network  must 
contain  a  cycle. 

In  many  applications,  some  nodes  are  in  fact  designated  as  sources  or  sinks  (or, 
alternatively,  supply  nodes  or  demand  nodes).  The  net  flow  out  of  a  source  may  be 
positive,  and  the  level  of  this  net  flow  may  either  be  fixed  or  variable,  depending  on 
the  application.  Similarly,  the  net  flow  into  a  sink  may  be  positive. 


D.2  Tree  Procedure 

Recall  that  node  j  is  reachable  from  node  i  in  a  directed  graph  if  there  is  a  path 
from  node  i  to  node  j.  For  simple  graphs,  determination  of  reachability  can  be  ac¬ 
complished  by  inspection,  but  for  large  graphs  it  generally  cannot.  The  problem  can 
be  solved  systematically  by  a  process  of  repeatedly  labeling  and  scanning  various 
nodes  in  the  graph.  This  procedure  is  the  backbone  of  a  number  of  methods  for  solv¬ 
ing  more  complex  graph  and  network  problems,  as  illustrated  later.  It  can  also  be 
used  to  establish  quickly  some  important  theoretical  results. 

Assume  that  we  wish  to  determine  whether  a  path  from  node  1  to  node  m  exists. 
At  each  step  of  the  algorithm,  each  node  is  either  unlabeled,  labeled  but  unscanned, 
or  labeled  and  scanned.  The  procedure  consists  of  these  steps: 
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Step  1.  Label  node  1  with  any  mark.  All  other  nodes  are  unlabeled. 

Step  2.  For  any  labeled  but  unscanned  node  i,  scan  the  node  by  finding  all  unla¬ 
beled  nodes  reachable  from  i  by  a  single  arc.  Label  these  nodes  with  an  i. 

Step  3.  If  node  m  is  labeled,  stop;  a  breakthrough  has  been  achieved — a  path 
exists.  If  no  unlabeled  nodes  can  be  labeled,  stop;  no  connecting  path  exists. 
Otherwise,  go  to  Step  2. 

The  process  is  illustrated  in  Fig.D.3,  where  a  path  between  nodes  1  and  10  is 
sought.  The  nodes  have  been  labeled  and  scanned  in  the  order  1,  2,  3,  5,  6,  8,  4,  7, 
9,  10.  The  labels  are  indicated  close  to  the  nodes.  The  arcs  that  were  used  in  the 
scanning  processes  are  indicated  by  heavy  lines.  Note  that  the  collection  of  nodes 
and  arcs  selected  by  the  process,  regarded  as  an  undirected  graph,  form  a  tree — 
a  graph  without  cycles.  This,  of  course,  accounts  for  the  name  of  the  process,  the 
tree  procedure.  If  one  is  interested  only  in  determining  whether  a  connecting  path 
exists  and  does  not  need  to  find  the  path  itself,  then  the  labels  need  only  be  simple 
check  marks  rather  than  node  indices.  However,  if  node  indices  are  used  as  labels, 
then  after  successful  completion  of  the  algorithm,  the  actual  connecting  path  can  be 
found  by  tracing  backward  from  node  m  by  following  the  labels.  In  the  example, 
one  begins  at  10  and  moves  to  node  7  as  indicated;  then  to  6,  3,  and  1.  The  path 
follows  the  reverse  of  this  sequence. 

It  is  easy  to  prove  that  the  algorithm  does  indeed  resolve  the  issue  of  the  existence 
of  a  connecting  path.  At  each  stage  of  the  process,  either  a  new  node  is  labeled, 
it  is  impossible  to  continue,  or  node  m  is  labeled  and  the  process  is  successfully 
terminated.  Clearly,  the  process  can  continue  for  at  most  n—  1  stages,  where  n  is  the 
number  of  nodes  in  the  graph.  Suppose  at  some  stage  it  is  impossible  to  continue. 
Let  S  be  the  set  of  labeled  nodes  at  that  stage  and  let  S  be  the  set  of  unlabeled  nodes. 
Clearly,  node  1  is  contained  in  S ,  and  node  m  is  contained  in  S .  If  there  were  a  path 
connecting  node  1  with  node  m,  then  there  must  be  an  arc  in  that  path  from  a  node  k 
in  S  to  a  node  in  S .  However,  this  would  imply  that  node  k  was  not  scanned,  which 
is  a  contradiction.  Conversely,  if  the  algorithm  does  continue  until  reaching  node 
m ,  then  it  is  clear  that  a  connecting  path  can  be  constructed  backward  as  outlined 
above. 


Fig.  D.3  The  scanning  procedure 
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D.3  Capacitated  Networks 

In  some  network  applications  it  is  useful  to  assume  that  there  are  upper  bounds 
on  the  allowable  flow  in  various  arcs.  This  motivates  the  concept  of  a  capacitated 
network.  A  capacitated  network  is  a  network  in  which  some  arcs  are  assigned  non¬ 
negative  capacities,  which  define  the  maximum  allowable  flow  in  those  arcs.  The 
capacity  of  an  arc  (/,  j)  is  denoted  kij ,  and  this  capacity  is  indicated  on  the  graph  by 
placing  the  number  kij  adjacent  to  the  arc.  Figure  2. 1  shows  an  example  of  a  network 
with  the  capacities  indicated.  Thus  the  capacity  from  node  1  to  node  2  is  12,  while 
that  from  node  2  to  node  1  is  6. 
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Chain,  329-331,  339,  387-390,  445 
hanging,  329-332,  386,  396,  443 
Cholesky  factorization,  210,  249 
Closed  mappings,  199-201,  314,  381 
Closed  set,  156,  157,  200 
Combinatorial  auction,  18,  108 
Compact  set,  201,  202,  253,  314,  475 
Complementary  formula,  293 
Complementary  slackness,  92-94,  99,  102, 
141,  341,  349,  434,  467,  468,  480 
Complexity  theory,  116-118,  208 
Concave  functions,  179,  188-192 
Condition  number,  237,  240,  241,  261,  297, 
301,  306,  312,  314,  389,  396,  452 

Cone 

dual,  150,  155,  174 


interior,  154 
self-dual,  172 

Conic  linear  programming  (CLP),  149-175 
compact  form,  151,  158 
duality,  158-166,  168,  171,  173,  174 
duality  gap,  162,  163,  166,  171 
dual  problem,  158 
facility  location,  159,  160 
Farkas’  lemma,  54-157,  164,  173 
infeasibility  certificate,  155,  173 
interior-point  algorithm,  149,  166, 

170-173 

linear  programming,  149-175 
matrix  to  vector  operator,  151,  155 
optimality  conditions,  164 
p-order  cone  programming,  150,  151,  154, 
172, 174 

potential  reduction  algorithm,  134 
SDP,  149-154,  158,  159,  161-163, 
165-171,  174,  175 

second-order  cone  programming,  150-152, 
154, 155,  159-162,  170 
strong  duality,  162-165,  168,  173 
vector  to  matrix  operator,  151,  155 
weak  duality,  162,  164,  174 
Conjugate  direction  method,  263-283,  292, 
473, 488 

algorithm,  263,  264,  266,  269-270,  277, 
278,  280,  281 

descent  properties  of,  266-268 
theorem,  265-276,  278-281 
Conjugate  gradient  method,  263,  266, 

268-283,  290,  293,  296,  297,  300, 
304-306,  312,  316,  390,  413-415, 
424,  451,473,490 
algorithm,  269,  277,  414 
non-quadratic,  268,  276-279 
PARTAN,  280,  281,  390 
partial,  273-276,  279,  283,  296,  306,  316, 
413,  424 

theorem,  269,  270,  273,  278 
Conjugate  residual  method,  490 
Constrained  problems,  2-5,  241,  274,  323, 
333,  335,  341,  342,  344,  348,  360, 
361,  363,  372,  384,  390,  397,  398, 
403, 407,  411-412,  417-424,  429, 
437,  440,  441,  445,  450,  469,  483, 
485,  488,  492 
Constraints 

active,  91,  322,  323,  340-343,  359-361, 
364-368,  378,  391,  392,  406-409, 
413,  424,  460,  467-468,  470,  485 
inactive,  322,  361,  364,  365 
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inequality,  190,  340-344,  348-349,  359, 
360,  379,  381,  392,  398,  401,  405, 
413,  414,  420,  423,  430,  433,  439, 
443,  452-454,  467,  468,  470,  471, 
475-478,  481,489,  492 
nonnegativity,  14,  15,  329,  347,  391 
quadratic,  161,  162 
redundant,  62,  72,  107 
Consumer  surplus,  332 
Control  example,  149,  332 
Convergence 

analysis,  7-8,  209,  224,  234,  242,  252,  274, 
278,  279,  287,  313,  391,  481,  483 
average  order  of,  207 
canonical  rate  of,  8,  398,  413,  415,  416, 
420,  424,  483 

of  descent  algorithms,  196-204 
dual  canonical  rate  of,  429,  441,  464 
of  ellipsoid  method,  122,  142 
geometric,  206 
global  theorem,  430 

linear,  205-206,  208,  233,  239,  257,  309, 
313 

of  Newton’s  method,  223,  246-253,  257, 
260,  278,  285-287,  308,  309,  311, 
313,  314,  412,  413,  475,  478,  480, 
484,  488 
order,  208,  225 
of  partial  C-G,  224,  279 
of  penalty  and  barrier  methods,  397-400, 
403,  412-418,  420,  424,  425,  427 
of  quasi-Newton,  285-287,  292-293, 

296-299,  306,  308,  309,  311-316 
rate,  7,  8,  204,  207,  236,  240-242, 

253-254,  256,  259,  262,  278,  285, 
286,311,314,  358,  370-378, 
383-390,  396,  414,  440-441,  452 
ratio,  31,  205-208,  211,  217,  236,  237, 

239,  242,  257,  261,  312,  372,  413, 
414, 429 

speed  of,  229-233,  254-256,  455.-457 
arithmetic  convergence,  206,  232,  257, 
457 

linear  convergence,  233 
order  of  convergence,  205,  207,  208, 
218-220,  222,  223,  225,  239,  246, 
249,  257,  258 

superlinear  convergence,  206,  207,  297, 
300,312,313,391 

of  steepest  descent,  229-242,  247-248, 
252-254,  256,  257,  259-261,  274, 
276,  278,  282,  285,  286,  308,  309, 
311-313,414 


superlinear,  206,  207,  296,  297,  300,  312, 
313,391 

theory  of,  239-243,  306 
of  vectors,  6,  207-208 
Convex  cones,  87,  149-150,  154,  170,  173, 
174 

barrier  function,  170 
conic-inequality,  150 
interior  of  cone,  154 
nonnegative  orthant,  149,  151 
p-order  cone,  150,  154 
product  of  cones,  150 
second-order  cone,  150,  170 
semidefmite  matrix,  149,  150 
Convex  duality,  351 

Convex  functions,  143,  188-195,  210,  266, 
344,  346,  348,  454,  455,  457,  458, 
460-462,  486,  487 
Convex  polyhedron,  115 
Convex  polytope,  23,  24 
Convex  programing  problem,  344,  350,  354, 
441,460,  461,464,  471 
Convex  sets,  22-24,  30,  41,  87,  88,  124,  156, 
157,  188,  190,  192-195,  210,  346, 
350,  430,  432,  454,  457,  458 
theory  of,  22,  23 

Convex  simplex  method,  391-392 
Coordinate  descent,  252-256,  261,  262, 

429 

Cubic  fit,  221-222,  258 
Curve  fitting,  214-215,  217-222,  224-226, 
228,  245,  257,  481 

Cutting  plane  methods,  143,  458-463,  465 
Cycle,  in  linear  programming,  80 
Cyclic  coordinate  descent,  252,  253 

D 

Damping,  246-248,  250,  251 
Dantzig- Wolfe  decomposition  method,  68,  81 
Data  analysis  procedures,  332 
Davidon-Fletcher-Powell  method  (DFP), 
290-294,  300,  315 
Decision  problems,  1,  15,  332 
Decomposition,  68-71,  443-445,  484 
LP,  68-71 
LU,  55-56,  210 
Deflected  gradients,  286 
Degeneracy,  39,  49,  50,  67-68,  74,  100,  104, 
115 

Descent 

algorithms,  196-204,  213,  252,  253,  257, 
260,312,  420,  423 

function,  198,  199,  201-204,  209,  224, 
226,  229,  249,  253,477,  481 
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Diet  problem,  14,  45,  85,  93,  94,  101 
dual  of,  85 

Differentiable  convex  functions,  properties  of, 
190-192 

Directed  graphs,  250 

Duality,  83-114,  125,  130,  140,  158-166,  168, 
351,429-465 
asymmetric  form  of,  85 
canonical  convergence  rate,  429,  441,  464 
central  path,  127-129 
feasible,  91,  100-103,  112,  126,  127, 
129-131,  133,  140,  163 
function  defined,  431,  437,  439,  440, 
450-452 

gap,  130,  131,  134,  135,  162,  163,  166, 
171,430,  432-435 

linear  program,  83-86,  106,  158,  182 
local,  430,  435-440,  445,  450 
simplex  method,  100-102,  111,  112,  131 
theorem,  86-89,  94,  99,  104,  163-165, 

174,  326,  353,  410,  429,  430,  432, 
433,  436,  438,  439,  446,  450,  464 


Economic  interpretation 

of  Dantzig- Wolfe  decomposition,  68,  81 
of  decomposition,  7 1 
of  primal-dual  algorithm,  102 
of  relative  cost  coefficients,  83 
of  simplex  multipliers,  92 
Eigenvalues,  120,  167,  208,  233,  272,  287, 
335,  370,  406,  440,  483 
interlocking,  241,  300-302,  389 
steepest  descent  rate,  240 
in  tangent  space,  491 

Eigenvector,  120,  234,  259,  272,  282,  293, 
301,303,311,335,336,  451 
Ellipsoid  method,  116,  119-122,  142,  143 
Entropy,  328,  329,  352,  395,  464 
Epigraph,  194,  195,  346,  349,  351,  432 
Error 

function,  207,  208,  211,  236,  490 
tolerance,  369 

Exact  penalty  theorem,  422,  427 
Expanding  subspace  theorem,  266-268,  270, 
271,281 

Exponential  density,  329 
Extreme  point,  23-26,  28,  30,  33,  38-42,  69, 
71,91,  108,  193,  194 


F 

False  position  method,  218-222,  245,  257 
Feasible  direction  methods,  358-360,  392 
Feasible  solution,  20,  33,  87,  118,  153,  364, 
398,  460 

Fibonacci  search  method,  214-217 
First-order  necessary  conditions,  180-182, 
186-187,  193,  194,  210,  259,  260, 
326-328,  330,  333,  337,  340-342, 
352,  379,  395,  438,  464,  467,  470, 
473,  475,  476,  478,  479,  481,  486, 
488 

Fletcher-Reeves  method,  278,  281 
Free  variables,  13,  52-53,  85,  100 
Full  rank  assumption,  20 

G 

Game  theory,  108-109 

Gaussian  elimination,  34,  60,  73,  168,  208,  282 
Gauss-Southwell  method,  252-254,  260 
Generalized  reduced  gradient  method,  382 
Geodesics,  371-375,  377,  395 
Geometric  convergence,  206,  233,  257 
Global  convergence,  7,  196-204,  209, 

224-226,  229-239,  253,  257,  260, 
261,  278,  281,  296,  308,  312,  316, 
381,  391,  392,  395,  396,  400,  427, 
463,  465,  469,  475,  480, 489 
theorem,  224,  225,  481 
Global  duality,  430-435,  464 
Global  minimum  points,  180,  182,  193,  209, 
210,  259,  260,  354,  363,  470 
Golden  section  ratio,  217 
Goldstein  test,  259 

Gradient  projection  method,  364-378,  382, 
386,  389,  390,  392,  394,  395, 
417-420 

convergence  rate  of  the,  370-378,  383-390 
Graph,  73,  113,  154,  188,  195,  199,  200,  203, 
250 


H 

Half  space,  69,  90,  459-461,  463 
Hanging  chain,  329-332,  386-390,  396, 
443-445 

Hessian  matrix,  159,  186,  187,  192,  237,  239, 
246,  247,  251,  254,  262,  279,  285, 
338,  342-343,  372,  386,  406-408, 
414,  451,470 
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Hessian  of  dual,  425,  437-438,  440,  444,  451, 

464 

Hessian  of  the  Lagrangian,  351,  357,  370,  385, 
387, 392,  406-408,  424,  436-439, 
444,  447,  450,  471,  472,  474,  477, 
478 

Homogeneous  self-dual  algorithm  (HSD), 
139-142,  172,  173 
infeasibility  certificate  CLP,  173 
infeasibility  certificate  LP,  142 
optimal  solution  CLP,  172,  173 
optimal  solution  LP,  141 
Hyperplanes,  17,  86,  88,  157,  194,  195,  346, 
347,  349,  351,  430-432,  459-463, 

465 

I 

Implicit  function  theorem,  325,  339,  345,  436 
Inaccurate  line  search,  227-228,  258,  296, 

299,  300,  393 
Incidence  matrix,  94,  95 
Initialization,  137-142,  172-173 
Integrality  gap,  72 

Interior-point  methods,  3,  6,  7,  115-145, 

249-251,  257,  369,  402,  485-489 
Interlocking  eigenvalues  lemma,  241,  300,  389 
Iterative  algorithm,  6-8,  116,  179,  197-198, 
207,  209,  229,  257,  260,  314,  469 

J 

Jacobian  matrix,  325,  339,  485 
Jamming,  360,  363,  378,  381,  392,  413 

K 

Kantorovich  inequality,  235,  236,  287,  311, 
377 

Karush-Kuhn-Tucker  conditions,  5,  340-341, 
365, 366,  393-395 

Kelley’s  convex  cutting  plane  algorithm, 
460-463,  465 

Khachiyan’s  ellipsoid  method,  116 

L 

Lagrange  multiplier,  5,  125,  126,  128,  321, 
326,  328-331,  338,  339,  341-344, 
346-351,  361-363,  365,  374,  387, 
398,  405-406,  408-410,  422,  426, 
429,  430,  432-434,  436,  438-440, 
446,  450,  453,  464,  476,  478,  481, 
482,  492 

Levenberg-Marquardt  type  methods,  249 
Limit  superior,  205 

Linear  convergence,  205-208,  233,  239,  257, 
309,313 


Linear  programing,  2,  11,  35,  83,  115,  149, 
182,  226,  326,  359,  402,  430,  469 
analytic  center,  123,  182 
central  path,  125,  130,  133,  172,  173,  410 
complementarity,  83-86,  88-90,  92,  93,  99, 
100,  102,  106-110,  112,  113,  138, 
166-171 

duality,  83-86,  88-90,  92,  93,  99,  100,  102, 
106-110,  112,  113,  125,  158-166, 
430, 459-464 

examples  of,  2,  12,  14-19,  332 
fundamental  theorem  of,  20-22,  24,  166, 
173 

potential  function,  170,  173,  487,  488 
presolver,  72 

Linear  variety,  34,  266,  307,  316 
Line  search,  135,  213-229,  237-239,  245,  248, 
253,  254,  257-260,  277-278,  281, 
283,  291,  295-300,  302,  305,  312, 
314,  359,  392,  393,461 
Lipschitz  condition,  255,  493 
Local  convergence,  7,  208,  209,  224,  253-254, 
278,  297-299 

Local  duality,  435-440,  445 
Local  minimum  point,  180,  246,  260,  322,  331, 
351,  354,  436,  444,  446,  447,  470 
Logarithmic  barrier  method,  143,  250,  257, 
409,  410,  485-487 
LU  decomposition,  55-56,  210 

M 

Manufacturing  problem,  14,  104 
Mappings,  197-201,  226,  249,  259,  314,  381, 
392-394,  490 
Marginal  price,  93,  94,  338 
Markowitz  portfolio  model,  165 
Marriage  problem,  80 
Master  problem,  69-7 1 
Matrix 

Frobenius  norm,  150 
inner  product,  150,  289 
notation,  57 

positive  definite,  120,  149,  154,  161,  192, 
208,  210,  233,  235,  246-249,  259, 
263,  274,  275,  281,  290,  295,  303, 
305,  309,  314,  315,  337,  339,  354, 
386,  404,  407,  415,  446,  447,  468, 
470,  471,473,474,  482,  492 
projection  matrix,  338,  365,  366,  369,  376, 
394 

Max  flow-min  cut  theorem,  94-99 
Maximal  flow,  16,  94,  95,  97-99,  113 
Mean  value  theorem,  220,  422 
Memoryless  quasi-Newton  method,  304-306 
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Minimum  point,  179-182,  185-188,  192,  193, 
201,  209,  210,  213,  214,  217,  218, 
221,  222,  224,  226,  227,  229,  233, 
235,  246,  253,  260,  280,  282,  286, 
291,293,307,  322,  331,337, 
340-344,  351,  354,  360,  362,  363, 
398,  409,  412,  422,  436,  444-448, 
463, 470 

Morrison’s  method,  426-427 
Multiplier  methods,  429,  449-454 

N 

Newton’s  method,  125,  173,  213,  263,  285, 
390,410,  451,470 

modified,  248,  257,  260,  286-288,  305, 
306,312-314,412,413,424, 
477-481,483,484 
Node-arc  incidence  matrix,  94,  95 
Nodes,  16,  94-96,  98 

Nondegeneracy,  20,  39,  44,  46,  47,  49,  58,  75, 
93,  138,  343,  363,  378,  382,  395, 
406 

Nonextremal  variable,  114 
Normalizing  constraint,  139,  140 
Northwest  comer  rule,  59-61,  64,  66,  67 
Null  variables,  114 

O 

Oil  refinery  problem,  75 
Optimal  control,  5,  149,  332,  353 
Optimal  feasible  solution,  20-22,  33 
Order  of  convergence,  205-208,  211, 

218-220,  222,  223,  225,  239,  246, 
249,  257,  258 

Orthogonal  complement,  418,  440,  441 
Orthogonal  matrix,  51,  166,  264 

P 

Parallel  tangents  method.  See  PARTAN 
Parimutuel  auction,  18 
PARTAN,  275-277,  376 

advantages  and  disadvantages  of,  277 
theorem,  276-277 

Partial  conjugate  gradient  method,  273-276, 
279,  296,316,413 
Partial  duality,  440 
Partial  quasi-Newton  method,  296 
Path-following,  126,  127,  129-131,  374,  472, 
499 

Penalty  methods,  241,  397,  398,  404-406, 

408,  412-428,  445,  446,  448,  453, 
454,  470,  479-482,  489,  492 


interpretation  of,  423 
normalization  of,  415-417 
Percentage  test,  259 

Pivoting,  33,  35,  36,  41,  47,  51,  55,  73,  104, 
106 

Pivot  transformations,  54 
Point-to-set  mappings,  197-200,  211 
Polak-Ribiere  method,  278,  305 
Polyhedron,  24,  69,  115,  119 
Polynomial  time,  7,  116,  118-119,  138,  143, 
209 

Polytopes,  23,  24,  69,  124,  458-461,  463,  465 
Portfolio  analysis,  332 
Positive  definite  matrix,  120,  154,  161,  264, 
295,312,  407 

Potential  function,  123,  124,  134,  135,  142, 
143, 170-173,  470,  487-488 
conic  linear  programming,  170 
convex  quadratic  programming,  486-487 
linear  programming,  142 
Power  generating  example,  184-185 
Preconditioning,  306 
Predictor-corrector  method,  134,  143 
Primal  central  path,  126,  128,  131 
Primal-dual 

algorithm  for  LP,  102-106 
central  path,  129-130,  133,  470,  486 
methods,  102,  103,  105-106,  467-493 
optimality  theorem,  103 
path,  128-130,  133-134 
potential  function,  134-135,  171 
Primal  method,  345,  348,  349,  351,  357-397, 
410-412,  422,  423,  430,  432,  435, 
441,  448,  449,  464,  469 
advantage  of,  357-358 
Projected  Hessian  test,  338,  353 
Projection  matrix,  338,  365,  366,  369,  376, 
394 

Purification  procedure,  138 

Q 

Quadratic 

approximation,  276-277 
binary  optimization,  152-153 
fit  method,  218-221 
minimization  problem,  264,  27 1 
penalty  function,  412,  414,  415,  421,  428, 
448,  479,  481,482,  489 
program,  262,  349,  354,  396,  434,  468, 
470,  473,  474,  476-483,  486-489, 
492,  493 

Schur  complements,  161 
second-order  cone  program,  151,  152 
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semidefinite  program,  158,  159,  161 
semidefinite  relaxation,  152,  153,  159,  168 
Quasi-Newton  methods,  285-316,  390,  391, 
451 

memoryless,  304-306 

R 

Rank,  19-21,  23,  30,  50,  86,  124,  137, 

152-154,  166-171,  173-175,  241, 
289-290,  293-295,  309,  315,  316, 
325,  337,  338,  345,  354,  364,  407, 
419,  424,  427,  438,  468,  470,  471, 
474,  485,  489 

Rank-one  correction,  289-290,  315 
Rank-reduction  procedure,  153,  154,  167 
Rank-two  correction,  290,  294 
Rate  of  convergence,  186,  203,  234,  238-242, 
247-248,  250,  253,  254,  259,  260, 
274-276,  282,  286,  309,  313,  314, 
316,  335,  357,  370,  372,  375-378, 
383,  384,  386,  392,  395,  396,  398, 
413-417,  420,  424-425,  427,  441, 
448,  450,  451,  464,  481,  483-484, 
491 

Real  number 

arithmetic  model,  118 
sets  of,  350,  404 

Recursive  quadratic  programing,  470, 
476-481,  483,  489,  492-493 
Reduced  cost  coefficients,  44,  68 
Reduced  gradient  method,  378-392,  395,  396 
convergence  rate  of  the,  383-390 
Redundant  equations,  58,  61 
Relative  cost  coefficients,  44-47,  55,  63,  64, 
66,  68,71,72,  74,  78,  79,  83,92, 
105,  112 

Requirements  space,  41-42,  90,  91 
Revised  simplex  method,  55-56,  62,  68,  70, 
71,77,  92 
Robust  set,  401 

S 

Scaling,  171,  195,  241-243,  253,  279,  298, 
300-304,  306,311-313,315, 
331-332,  340,  347,  395,  441,  484, 
493 

SDP  relaxation 

approximation  ratio,  168-169,  175 
quadratic  optimization,  152-153 
rank -d  solution,  154 
rank-1  solution,  153,  168 


Search  by  golden  section,  214-218 
Second-order  conditions,  179,  185-188, 

333-335,  337,  342-343,  351,  352 
Self-concordant  function,  25 1-252,  262 
Self-dual  linear  program,  110,  139,  140 
Semidefinite  programing  (SDP),  149-154, 

158-159,  161-163,  165-171,  174, 
175, 470 

central  path,  171,  175 
complementarity  conditions,  166-170 
exact  rank  reduction,  153,  154 
primal-dual  potential  function,  171 
randomized  binary  reduction,  169-170 
randomized  rank  reduction,  153,  154 
solution  rank,  166-170 
Sensitivity,  92-94,  208,  297,  300,  338-339, 
343-344,  351,362,411,429 
Sensor  network  localization,  153-154,  159, 
173, 174 

Separable  problem,  441-445,  464 
Separating  hyperplane  theorem,  86,  157,  195, 
346 

Sets,  12,  22,  23,  96,  123,  128,  129,  143,  188, 
190,  195,  344,  346,  349,  351,  363, 
401,430,  454,  457,  487 
Sherman-Morrison  formula,  294,  45 1 
Simple  merit  function,  470-472,  475,  477, 

479,  488 

Simplex  method,  3,  6,  7,  13,  20,  33-81,  83,  86, 
95,  108,  111,  112,  115,  116,  118, 
119,  130,  131,  137,  138,  142,  143, 
378,  469 

for  dual,  33,  39,  41,  54,  83,  86,  95,  98-102, 
108,  111,  112,  130,  131,  138 
and  dual  problem,  83-90,  93,  94,  99,  100, 
102,  104,  107,  109,  111,  114 
and  LU  decomposition,  55-56 
matrix  form  of,  54-55 
for  minimum  cost  flow,  71,  72 
revised,  55-56,  62,  68,  70,  71,  77 
for  transportation  problems,  56-68 
Simplex  multipliers,  62-64,  66,  67,  70,  79,  92, 
93, 100 

Simplex  tableau,  45,  46,  52,  55,  89,  100,  103, 
105 

Slack  variables,  12,  17,  26,  47-49,  51,  89,  100, 
112,  119,  124,  125,  127,  128,  162, 
382 

Slack  vector,  124,  140 

Slater  condition,  348,  349,  354 

Spacer  step,  203-204,  209,  278,  296 
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Steepest  descent,  204,  213,  229-245,  247-248, 
252-254,  256,  257,  259-263,  268, 
269, 272-274,  276,  278-282, 
285-287,  297-299,  306-314,  316, 
364,  371,  372,  379,  384,  392,  402, 
412,  414,  416-418,  420,  424,  427 
applications,  240-243 

Stopping  criterion,  78,  238-239,  261.  See  also 
Termination 

Strong  duality  theorem,  163,  165,  432-434 

Superlinear  convergence,  206,  207,  297,  300, 
312,313,391 

Supporting  hyperplane,  193-194,  462,  463, 

465 

Support  vector  machines,  17 

Surplus  variables,  12-13,  76,  101 

Synthetic  carrot,  45 

T 

Tableau,  35,  37,  38,  40-43,  45-48,  50-55,  73, 
75-77,  80,  89,  90,  100-103,  105, 
106,381 

Tangent  plane,  323-326,  333,  341-343,  351, 
368,  369,  371,484 

Taylor’s  Theorem,  188,  192,  221,  223,  334 

Termination,  47,  70,  98,  136-143,  172,  228, 
245,  261,  262,  268,  277,  279,  297, 
299,  300,  358,  363,  365,  373,  391 

Transportation  problem,  6,  15-16,  29,  56-68, 
72,78-81,85-86,  95,  107 
dual  of,  85-86 

northwest  corner  rule,  59-61,  64,  66,  67 
simplex  method  for,  56-68 


Transshipment  problem,  15-16,  56,  58 
Tree  algorithm,  95 

Triangularity,  23,  34,  60-63,  65,  72,  78,  79, 
95,  249,317,386 
bases,  60-62,  64 
matrices,  56,  60,  210 
Triangularization  procedure,  63 
Turing  model  of  computation,  118 

U 

Unimodal,  214,  215,  217,  224-226,  258,  315 

Unimodular,  78 

Upper  triangular,  56,  79,  210 

V 

Variable  metric  method,  290 

W 

Warehousing  problem,  16 
Weak  duality 
lemma,  87,  130 
proposition,  43 1 
Wolfe  test,  230 

Working  set,  361-366,  380,  392 
Working  surface,  361-365,  368,  379,  380 

Z 

Zero-duality  gap,  131,  166 
Zero-order 

conditions,  194-196,  344-351 
Lagrange  theorem,  350,  433 
Zigzagging,  360,  363,  364 
Zoutendijk  method,  359,  360 


